Our Team:
What is Reinforcement Learning?
Reinforcement Learning (RL) is a subset of machine learning that focuses on how an agent interacts with its environment to achieve a specific goal by taking a sequence of actions that maximizes a numerical reward signal. The process of RL involves balancing the exploration and exploitation of different actions. Exploration refers to the agent choosing an action that it has not tried before, while exploitation involves following a known action that has previously led to a positive outcome.
Examples application of Reinforcement Learning:
Trading: RL can be used to make trading decisions by training an agent to choose the best actions based on historical market data and other relevant factors. The agent, in this case, would be the trading algorithm while the environment would include the market conditions and other traders. RL can help in finding the optimal trading strategies and predicting the market trends.
Video Games: Reinforcement learning can also be used to create intelligent game playing agents. The agent in this case would learn to play a game by exploring different strategies, receiving rewards and penalties based on the outcome of its actions and adapting its behavior accordingly. For example, a RL-based agent playing chess would learn to make the best moves by trying out different strategies, receiving rewards for winning and penalties for losing, and updating its decision-making policy based on this experience.
Robotics: Reinforcement learning can be used to control the behavior of robots. The agent can learn to perform tasks by exploring different actions, receiving rewards or penalties based on the outcome and updating its behavior accordingly. For example, a RL-based robot can be trained to navigate through a maze or pick and place objects.
Control Systems: Reinforcement learning can be applied to control systems to optimize their performance. The agent can learn to control the system by trying out different actions, receiving rewards or penalties based on the system's performance, and updating its behavior accordingly. For example, a RL-based control system can be used to optimize the energy consumption of a
📙 Main Heading
📖 Subheading
🤓 Research/Discussion
🤖 RL code/modeling/training
🔬 RL model evaluation/analysis
⚙️ Config/utility code (Found throughout notebook)
| Description | Headings |
|---|---|
📙 Imports & Configuration |
|
📖 Libraries |
|
📖🤓 Environment Background Research |
|
📙🤓 Deep Q-learning |
|
📖🤖 Deep Q-learning Network - (DQN) |
|
📖🤓🤖 Deep Q-learning Network Agent |
|
📖🤓🤖 Experience Replay |
|
📖🤓 Epsilon Greedy |
|
📖🤖 Deep Q-learning Network Training |
|
📖🔬 Evaluate DQN Performance |
|
📖🤓 Double Deep Q-Learning Network - (DDQN) |
|
📖🤖 Double Deep Q-Learning Network Modelling |
|
📖🤖 Double Deep Q-Learning Network Training |
|
📖🔬 Evaluate DDQN Performance |
|
📙🤓 Actor and Critic /w PPO |
|
📖🤖 Actor and Critic Network Modelling |
|
📖🤖 Training Algorithm Code |
|
📖🤖 Training /w Actor & Critic (PPO) |
|
📖🔬 Evaluate Actor & Critic /w PPO Performance |
|
📙🤓 DQN + Prioritized Experience Replay - Improving Our Best Candidate |
|
📖🤖 Adding PER into our DQN Agent |
|
📖🤖 Training DQN + PER |
|
📖🔬 Evaluate DQN + PER Performance |
|
📙🤓 Hyperparameter Tuning - 5 Hyperparameters |
|
📖🤖 Running Hyperparameter Tuner |
|
📖🔬 Evaluate Hyperparameter Tuned DQN + PER |
|
📙 Final Evaluation - Objective Testing |
|
📖🤖 Training All Models (1000 Episodes) |
|
📖🔬 Testing All Models (500 Episodes) |
|
📖🔬 Final Evaluation Of Test Results |
# !pip install swig
# !pip install gym[box2d]
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.distributions as distributions
import torch.optim as optim
import base64, io, os
from copy import deepcopy
from tqdm.auto import tqdm
from ipywidgets import Output, GridspecLayout, Layout
from IPython.display import clear_output
from IPython import display
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.lines import Line2D
import numpy as np
import pandas as pd
import random, json, itertools, time
from collections import deque, namedtuple
# For visualization
import gym
from gym.wrappers.monitoring import video_recorder
from IPython.display import HTML
from IPython import display
import glob
device = torch.device("cpu" if torch.cuda.is_available() else "cpu")
OpenAI GYM
OpenAI GYM is a toolkit for developing and comparing reinforcement learning algorithms. It has a library that contains a variety of reinforcement learning tasks, including trading simulations. These environments have a shared and easy-to-use interface, allowing researchers and developers to train and compare agents using different algorithms.
OpenAI Gym Environment: LunarLander-v2
Lunar Lander is one of OpenAI's gym environments, where the agent is a lunar lander that tries to land on a landing pad at coordinate (0,0). These coordinates are the first 2 numbers in the state vector. Also, this environment is created by Oleg Klimov.
There are a total of 4 discrete actions that the lander can take:
There are a total of 8 observation spaces:
Reward system of Lunar Lander:
From the above, we can conclude that the best episode would be the one where the lander lands (+10 each leg contact) and comes to rest (+100 for rest) on the center of the landing pad at zero speed (+120 for landing) with the least steps (+220 - (count_main_engine_fire * 0.3 + count_side_engine_fire * 0.03)).
"enable_wind" is a parameter in the Lunar Lander v2 environment of OpenAI Gym. It determines whether wind is included in the simulation or not. When enable_wind is set to True, wind is included as a disturbance force acting on the lunar lander, adding an extra layer of difficulty to the task of landing. For our assignment here, we will be enabling the wind in the environment.
[References: OpenAI - LunarLander Documentation]
Note that env.seed() is already deprecated in the latest version of gym, to set seed just simply so np.random.seed()
env = gym.make('LunarLander-v2',enable_wind=True)
np.random.seed(0)
print('State shape: ', env.observation_space.shape)
print('Number of actions: ', env.action_space.n)
State shape: (8,) Number of actions: 4
Q Learning
Q Learning builds a Q-table of state-action values, with the number of states and actions being the dimension of the table. This table maps state and action pairs to a Q-value. The disadvantage for this method is that in real-world scenario, the table can become very large and becomes difficulty to manage.
Q Function
The Q-function in reinforcement learning represents the expected cumulative discounted reward of taking a specific action in a given state and following a fixed policy thereafter. The formula for the Q-function is given by:
$$Q(s, a) = E[R_t + \gamma * R_{t+1} + \gamma^2 * R_{t+2} + ... | s_t = s, a_t = a ]$$where:
| Symbol | Meaning |
|---|---|
| $s_t$ | represents the state at time t |
| $a_t$ | represents the action taken at time t |
| $R_t$ | represents the reward at time t |
| $\gamma$ | represents the discount factor, which determines the importance of future rewards relative to immediate rewards (0 < $\gamma$ $\leq$ 1) |
The Q-function estimates the expected cumulative discounted reward of taking action "a" in state "s" and following a fixed policy thereafter. It is used to determine the optimal policy, which is the policy that selects the action that maximizes the Q-value for each state.
Deep Q Learning
To address this issue, a Q function can be used instead of a Q table, this achieves the same result of the state-action pair mapping to the Q-value. Neural networks are excellent at modelling complex functions, we can use a neural network, Deep Q Learning (DQN) to approximate this Q function. This Q value can also be referred to as state-action function.
The Q network is the agent that is trained to produce the optimal state-action value. This current Q network is a fairly standard network architecture, containing a few linear layers. DQN architectures contains two neural networks, online network and target network.
Target Network
The target network has the exact same architecture as the online network. This network will not be trained and will only output predictions, these outputs are referred to as Target Q values.
Reason for this second network is to help stabilize training process. During training, the online network is trained on randomly intialized values each step. The issue with this is that the network's estimate can change rapidly during training, making the training process unstable. The target network is implemented to provide a more stable target for the network to learn from. The online network is trained to estimate the Q values of the target network, rather than its own estimates. This allows for a more stable training process
This target network is only updated periodically with the parameters of the online network. This is referred to as a soft update as it is not a complete copy, but rather an update with a smaller weight.
[References: Reinforcement Learning Explained Visually (Part 5): Deep Q Networks, step-by-step]
class QNetwork(nn.Module):
def __init__(self, state_dim: int, action_dim: int, hidden_size=64):
# Initialize the parameters and model
super(QNetwork, self).__init__()
self.fc1 = nn.Linear(state_dim, hidden_size) # First fully connected layer with state_dim inputs
self.fc2 = nn.Linear(hidden_size, hidden_size) # Second fully connected layer
self.fc3 = nn.Linear(hidden_size, action_dim) # Third fully connected layer with action_dim outputs
def forward(self, state):
# Build the network that maps state -> action values
x = self.fc1(state)
x = F.relu(x)
x = self.fc2(x)
x = F.relu(x)
return self.fc3(x)
Agent
The agent is the entity that takes actions in the environment and receives feedback in the form of rewards. The agent's behavior is determined bythea Q-network. The DQN algorithm trains this network to output the expected cumulative reward for taking a specific action in a given state. The agent selects actions based on the outputs of the Q-network, and the network is updated based on the observed rewards.
Mean Squared Error
Mean squared error (MSE) will be used to compute the differences in the predicted reward and observed reward. The MSE loss function measures the average squared difference between the predicted values and the actual values, which provides a measure of the error in the predictions. Minimizing the MSE loss function with gradient descent leads to improved predictions, which leads to better performance in the reinforcement learning task.
Bellman Equation
The Bellman equation defines the relationship between the expected cumulative reward for a given state and action, and the expected cumulative reward for the next state that results from that action. It provides a way to recursively compute the optimal action-value function, which maps states and actions to expected cumulative rewards.
class Agent:
# Interacts with and learns from the environment.
def __init__(self, state_dim, action_dim, hidden_dim, network):
'''
Initialize an Agent object.
Parameters
----------
state_dim (int): Dimension of each state
action_dim (int): Dimension of each action
'''
self.state_dim = state_dim
self.action_dim = action_dim
# Q-Network
self.qnetwork_online = network(state_dim, action_dim, hidden_dim).to(device)
self.qnetwork_target = network(state_dim, action_dim, hidden_dim).to(device)
self.optimizer = optim.Adam(self.qnetwork_online.parameters(), lr=LR)
# Replay memory
self.memory = ReplayBuffer(action_dim, BUFFER_SIZE, BATCH_SIZE)
# Initialize time step (for updating every UPDATE_EVERY steps)
self.t_step = 0
def step(self, state, action, reward, next_state, done):
# Saves the experience in replay memory, and learns from it in specified intervals."
self.memory.add(state, action, reward, next_state, done)
self.t_step = (self.t_step + 1) % UPDATE_EVERY
if self.t_step == 0:
if len(self.memory) > BATCH_SIZE:
experiences = self.memory.sample()
self.learn(experiences, GAMMA)
def act(self, state, eps=0.):
'''
Returns actions for given state as per current policy.
Parameters
----------
state (array_like): Current state
eps (float): Epsilon, for epsilon-greedy action selection
Returns
-------
int: The selected action
'''
state = torch.from_numpy(state).float().unsqueeze(0).to(device)
self.qnetwork_online.eval()
with torch.no_grad():
action_values = self.qnetwork_online(state)
self.qnetwork_online.train()
# Epsilon-greedy action selection
if random.random() > eps:
return np.argmax(action_values.cpu().data.numpy())
else:
return random.choice(np.arange(self.action_dim))
def learn(self, experiences, gamma):
'''
Update value parameters using given batch of experience tuples.
Parameters
----------
experiences (Tuple[torch.Variable]): Tuple of (s, a, r, s', done) tuples
gamma (float): Discount factor
'''
# Obtain random minibatch of tuples from D
states, actions, rewards, next_states, dones = experiences
# Compute and minimize the loss
# Extract next maximum estimated value from target network
q_targets_next = self.qnetwork_target(next_states).detach().max(1)[0].unsqueeze(1)
# Calculate target value from bellman equation
q_targets = rewards + gamma * q_targets_next * (1 - dones)
# Calculate expected value from local network
q_expected = self.qnetwork_online(states).gather(1, actions)
# Loss calculation (we used Mean squared error)
loss = F.mse_loss(q_expected, q_targets)
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
# Update target network
self.soft_update(self.qnetwork_online, self.qnetwork_target, TAU)
def soft_update(self, local_model, target_model, tau):
'''
Soft update model parameters.
θ_target = τ*θ_local + (1 - τ)*θ_target
Parameters:
----------
local_model (PyTorch model): weights will be copied from
target_model (PyTorch model): weights will be copied to
tau (float): interpolation parameter
'''
# Copy weights of the local (online) network to the target network
for target_param, local_param in zip(target_model.parameters(), local_model.parameters()):
target_param.data.copy_(tau*local_param.data + (1.0-tau)*target_param.data)
The idea behind experience replay is to store the agent's experiences, represented as tuples of (state, action, reward, next_state), in a memory buffer, and then randomly sample these experiences to train the Q-network. Random selection is performed to ensure that the batch is shuffled and contains diversity from older and newer samples.
Neural network are typically trained on batches of data. If we were to train it with a single sample each iteration, the resulting gradient will have too much variance and the network weights will never converge. Not only can this technique stabilize the training process, it can also replay rare experiences, ones that are infrequent. By storing them in the memory, the agent can replay them multiple times and learn from them more effectively.
class ReplayBuffer:
# A fixed-size container to store experience tuples.
def __init__(self, action_dim, buffer_size, batch_size):
'''
Initialize the buffer object.
Parameters
----------
action_dim (int): dimension of each action
buffer_size (int): maximum size of the buffer
batch_size (int): size of each training batch
'''
self.action_dim = action_dim
self.memory = deque(maxlen=buffer_size)
self.batch_size = batch_size
self.experience = namedtuple("Experience", field_names=["state", "action", "reward", "next_state", "done"])
def add(self, state, action, reward, next_state, done):
# Add a new experience to the buffer.
e = self.experience(state, action, reward, next_state, done)
self.memory.append(e)
def sample(self):
# Select a random batch of experiences from the buffer.
experiences = random.sample(self.memory, k=self.batch_size)
states = torch.from_numpy(np.vstack([e.state for e in experiences if e is not None])).float().to(device)
actions = torch.from_numpy(np.vstack([e.action for e in experiences if e is not None])).long().to(device)
rewards = torch.from_numpy(np.vstack([e.reward for e in experiences if e is not None])).float().to(device)
next_states = torch.from_numpy(np.vstack([e.next_state for e in experiences if e is not None])).float().to(device)
dones = torch.from_numpy(np.vstack([e.done for e in experiences if e is not None]).astype(np.uint8)).float().to(device)
return (states, actions, rewards, next_states, dones)
def __len__(self):
# Return the current size of the buffer.
return len(self.memory)
Videos of the agent in an environment will be saved at specified intevals of episodes.
def show_video(file_name, width = 400):
mp4list = glob.glob('video/*.mp4')
if len(mp4list) > 0:
mp4 = 'video/{}.mp4'.format(file_name)
video = io.open(mp4, 'r+b').read()
encoded = base64.b64encode(video)
display.display(HTML(data='''<video alt="test" autoplay
loop controls style="height: {}px;">
<source src="data:video/mp4;base64,{}" type="video/mp4" />
</video>'''.format(width, encoded.decode('ascii'))))
else:
print("Could not find video")
def save_video(agent, file_name, model_ckpt= 'checkpoint_best.pth', max_t=1000,seed = 0):
env = gym.make('LunarLander-v2',enable_wind=True, render_mode="rgb_array")
vid = video_recorder.VideoRecorder(env, path="video/{}.mp4".format(file_name))
agent.qnetwork_online.load_state_dict(torch.load('./models/' + model_ckpt))
state = env.reset(seed = seed)[0]
done = False
t = 0
rewards = 0
while not done and t != max_t:
t += 1
frame = env.render()
vid.capture_frame()
action = agent.act(state)
state, reward, done, _, _ = env.step(action)
rewards += reward
env.close()
return rewards
The greedy epsilon is a method used to balance exploration and exploitation in the Q-learning process. The Q-value function estimates the value of taking a certain action in a given state. In Q-learning, the agent selects the action with the highest Q-value, which is known as the greedy action. However, this approach can lead to the agent getting stuck in a suboptimal solution if it only selects the greedy action.
The epsilon-greedy algorithm addresses this issue by introducing a probability epsilon of selecting a random action instead of the greedy action. This allows the agent to explore new actions and states, which can lead to finding better solutions.
In the algorithm, max_epsilon and min_epsilon will be defined. During the course of the training, the epsilon will decay overtime until it reaches the value of min_epsilon. Decreasing the epsilon helps encourage the algorithm to rely more on the values it has learned and less on random exploration.
n_episodes = 500
min_epsilon = 0.01
max_epsilon = 1.0
decay_rate = 1-0.995
# initialize epsilon values for greedy search
epsilon_array = np.zeros((n_episodes))
for i in range(n_episodes):
epsilon = min_epsilon + (max_epsilon-min_epsilon)*np.exp(-decay_rate*i)
epsilon_array[i] = epsilon
plt.plot(epsilon_array)
plt.show()
The value of epsilon decreases over time, allowing the agent to gradually shift from exploration to exploitation.
The agent is trained for a maximum number of episodes (n_numbers), where each episode can run for a maximum of time steps (max_t). Time step controls how many time steps can be taken in each episode, the smaller it is, lesser time step the agent will be able to take to solve the environment in that episode.
The agent selects actions using an epsilon-greedy policy, where the value of epsilon starts from eps_start and gradually decreases to eps_end during training.
This function tracks and displays the average and current scores, episode lengths, and success and landing rates over the past 100 episodes (Simple Moving Average 100).
The current state of the agent's Q-network is saved every display_every episodes, and a video of the agent's performance is recorded and displayed every 2*display_every episodes. The best-performing agent is saved and its scores are displayed once the average score of the past 100 episodes reaches 200 or higher.
Code to train our agent with Deep Q-Learning Networks (includes DDQN) ⚙️
def train_agent(n_episodes: int=3000, max_t: int=1000, eps_start: float=1.0,
eps_end: float=0.01, eps_decay: float=0.995, display_every: int=150, model_name='DQN',
video_filepath='LunarLander_training'):
'''
Train a Network agent
Parameters:
n_episodes (int): Maximum number of episodes for training
max_t (int): Maximum number of timesteps per episode
eps_start (float): Initial value of epsilon for epsilon-greedy action selection
eps_end (float): Minimum value of epsilon
eps_decay (float): Factor to decrease epsilon per episode
'''
scores = [] # list containing scores from each episode
scores_SMA100 = []
scores_window = deque(maxlen=100) # last 100 scores
eps = eps_start # initialize epsilon
time_taken = []
time_taken_window = deque(maxlen=100)
success_rate = deque(maxlen=100)
landing_rate = deque(maxlen=100)
success_rate_SMA100 = []
landing_rate_SMA100 = []
for i_episode in tqdm(range(1, n_episodes+1)):
state = env.reset()[0]
score = 0
for t in range(max_t):
action = agent.act(state, eps)
next_state, reward, done, _, _ = env.step(action)
agent.step(state, action, reward, next_state, done)
state = next_state
score += reward
if done:
if score >= 200:
success_rate.append(1)
landing_rate.append(1)
elif score >= 120:
success_rate.append(0)
landing_rate.append(1)
else:
success_rate.append(0)
landing_rate.append(0)
break
scores_window.append(score) # save most recent score
scores_SMA100.append(np.mean(scores_window))
scores.append(score) # save most recent score
time_taken.append(t)
time_taken_window.append(t)
landing_rate_SMA100.append(landing_rate.count(1))
success_rate_SMA100.append(success_rate.count(1))
eps = max(eps_end, eps_decay*eps) # decrease epsilon
if i_episode % display_every == 0:
# SMA100: Average of past 100 period (Simple Moving Average)
print(f'\rEpisode {i_episode}\tAvg Score (SMA100): {np.mean(scores_window):.3f} Current Score: {scores_window[-1]:.0f}\nAvg Episode Length (SMA100): {np.mean(time_taken_window)} Current Episode Length: {time_taken_window[-1]:.3f}\nLanding Rate: {landing_rate.count(1):.0f}% | Success Rate: {success_rate.count(1):.0f}%\n')
torch.save(agent.qnetwork_online.state_dict(), f'.\models\{model_name+str(i_episode)}_train.pth')
if i_episode % (display_every*2) == 0:
save_video(agent, video_filepath, f'{model_name+str(i_episode)}_train.pth')
elif i_episode % (display_every*2+1) == 0:
show_video(video_filepath, 200)
if np.mean(scores_window)>=200.0:
print('\nEnvironment solved in {:d} episodes!'.format(i_episode))
print(f'\rAvg Score (SMA100): {np.mean(scores_window):.3f} Current Score: {scores_window[-1]:.0f}\nAvg Episode Length (SMA100): {np.mean(time_taken_window)} Current Episode Length: {time_taken_window[-1]:.3f}\nLanding Rate: {landing_rate.count(1):.0f}% | Success Rate: {success_rate.count(1):.0f}%\n')
torch.save(agent.qnetwork_online.state_dict(), f'.\models\{model_name}_best.pth')
break
return {
'scores': scores, 'scores_SMA100': scores_SMA100,
'scores_window': scores_window, 'time_taken': time_taken,
'time_taken_window': time_taken_window, 'success_rate': success_rate,
'landing_rate': landing_rate, 'landing_rate_SMA100': landing_rate_SMA100,
'success_rate_SMA100': success_rate_SMA100
}
BUFFER_SIZE = int(1e5) # replay buffer size
BATCH_SIZE = 64 # minibatch size
GAMMA = 0.99 # discount factor
TAU = 1e-3 # for soft update of target parameters
LR = 0.0005 # learning rate
UPDATE_EVERY = 4 # how often to update the network
Metrics Legend:
# Reset seed - Use same seed for all experiments (objective comparison)
np.random.seed(0)
env = gym.make('LunarLander-v2',enable_wind=True)
# DQN
agent = Agent(state_dim=8, action_dim=4, hidden_dim=64, network=QNetwork)
results_DQN = train_agent(display_every=200, max_t=1000, video_filepath='DQN')
0%| | 0/3000 [00:00<?, ?it/s]
Episode 200 Avg Score (SMA100): -219.803 Current Score: -227 Avg Episode Length (SMA100): 147.04 Current Episode Length: 171.000 Landing Rate: 0% | Success Rate: 0% Episode 400 Avg Score (SMA100): -135.108 Current Score: -1 Avg Episode Length (SMA100): 446.32 Current Episode Length: 243.000 Landing Rate: 0% | Success Rate: 0% Moviepy - Building video video/DQN.mp4. Moviepy - Writing video video/DQN.mp4
t: 0%| | 0/1000 [00:00<?, ?it/s, now=None]
t: 7%|▋ | 69/1000 [00:00<00:01, 684.46it/s, now=None]
t: 20%|██ | 204/1000 [00:00<00:00, 1073.47it/s, now=None]
t: 34%|███▍ | 343/1000 [00:00<00:00, 1215.14it/s, now=None]
t: 48%|████▊ | 478/1000 [00:00<00:00, 1265.86it/s, now=None]
t: 61%|██████ | 610/1000 [00:00<00:00, 1285.00it/s, now=None]
t: 74%|███████▍ | 744/1000 [00:00<00:00, 1299.69it/s, now=None]
t: 88%|████████▊ | 875/1000 [00:00<00:00, 1298.12it/s, now=None]
Moviepy - Done ! Moviepy - video ready video/DQN.mp4
Episode 600 Avg Score (SMA100): -150.778 Current Score: -173 Avg Episode Length (SMA100): 424.03 Current Episode Length: 146.000 Landing Rate: 0% | Success Rate: 0% Episode 800 Avg Score (SMA100): -153.087 Current Score: -64 Avg Episode Length (SMA100): 594.59 Current Episode Length: 999.000 Landing Rate: 1% | Success Rate: 0% Moviepy - Building video video/DQN.mp4. Moviepy - Writing video video/DQN.mp4
t: 0%| | 0/143 [00:00<?, ?it/s, now=None]
t: 43%|████▎ | 62/143 [00:00<00:00, 616.84it/s, now=None]
Moviepy - Done ! Moviepy - video ready video/DQN.mp4
Episode 1000 Avg Score (SMA100): -83.026 Current Score: -70 Avg Episode Length (SMA100): 868.31 Current Episode Length: 999.000 Landing Rate: 2% | Success Rate: 0% Episode 1200 Avg Score (SMA100): -23.869 Current Score: 200 Avg Episode Length (SMA100): 759.1 Current Episode Length: 676.000 Landing Rate: 31% | Success Rate: 18% Moviepy - Building video video/DQN.mp4. Moviepy - Writing video video/DQN.mp4
t: 0%| | 0/1000 [00:00<?, ?it/s, now=None]
t: 6%|▌ | 58/1000 [00:00<00:01, 576.98it/s, now=None]
t: 19%|█▉ | 190/1000 [00:00<00:00, 1010.04it/s, now=None]
t: 31%|███▏ | 313/1000 [00:00<00:00, 1110.05it/s, now=None]
t: 43%|████▎ | 434/1000 [00:00<00:00, 1145.14it/s, now=None]
t: 57%|█████▋ | 569/1000 [00:00<00:00, 1218.70it/s, now=None]
t: 71%|███████ | 708/1000 [00:00<00:00, 1273.02it/s, now=None]
t: 84%|████████▍ | 844/1000 [00:00<00:00, 1301.07it/s, now=None]
t: 98%|█████████▊| 980/1000 [00:00<00:00, 1318.95it/s, now=None]
Moviepy - Done ! Moviepy - video ready video/DQN.mp4
Episode 1400 Avg Score (SMA100): 15.677 Current Score: -113 Avg Episode Length (SMA100): 792.01 Current Episode Length: 999.000 Landing Rate: 77% | Success Rate: 31% Episode 1600 Avg Score (SMA100): 36.667 Current Score: 70 Avg Episode Length (SMA100): 678.93 Current Episode Length: 999.000 Landing Rate: 69% | Success Rate: 33% Moviepy - Building video video/DQN.mp4. Moviepy - Writing video video/DQN.mp4
t: 0%| | 0/344 [00:00<?, ?it/s, now=None]
t: 18%|█▊ | 63/344 [00:00<00:00, 626.76it/s, now=None]
t: 57%|█████▋ | 196/344 [00:00<00:00, 1036.40it/s, now=None]
t: 95%|█████████▌| 328/344 [00:00<00:00, 1162.81it/s, now=None]
Moviepy - Done ! Moviepy - video ready video/DQN.mp4
Episode 1800 Avg Score (SMA100): 128.531 Current Score: -110 Avg Episode Length (SMA100): 548.12 Current Episode Length: 232.000 Landing Rate: 80% | Success Rate: 63% Environment solved in 1951 episodes! Avg Score (SMA100): 201.090 Current Score: 274 Avg Episode Length (SMA100): 430.74 Current Episode Length: 655.000 Landing Rate: 89% | Success Rate: 72%
Minor code ⚙️
def saveJSON(dict, filename):
# Convert deque list to list
for key, value in dict.copy().items():
if isinstance(value, deque):
dict[key] = list(value)
# Save dictionary as .json
with open(filename, 'w') as handle:
json.dump(dict, handle)
def loadJSON(filename):
# Load .json
with open(filename, 'r') as handle:
result = json.load(handle)
return result
Utility code to plot graphs ⚙️
# Label line graph points
def labelMaxMin(ax, history, field):
# Find min
if field in ['time_taken', 'time_taken_window']:
minmax = np.min(history[field])
legend_value = f'Min: {minmax:.2f}'
y = minmax - 15
else: # Find max
minmax = np.max(history[field])
legend_value = f'Max {minmax:.2f}'
y = minmax * 1.05
# Label
epoch = np.where(history[field] == minmax)[0]
if len(epoch) > 1:
for elem in epoch:
ax.plot(elem, minmax, 'ro')
epoch = epoch[-1]
ax.annotate(f'{minmax:.4f}', xy=(epoch, y))
ax.plot(epoch, minmax, 'ro')
# Create legend for marker
legend_element = Line2D([0], [0], marker='o', color='r', label=legend_value)
return legend_element
def highlightAvg(ax, history, field, idx):
# Compute mean
mean_value = np.mean(history[field])
if idx == 0:
ax.axhline(y=mean_value, linestyle='--', color='blue')
else:
ax.axhline(y=mean_value, linestyle='--', color='orange')
# Label mean value
epochs = len(history[field])
text = f'{mean_value:.4f}'
range_value = max(history[field]) - min(history[field])
if mean_value > 100:
ax.annotate(text, xy=(-5, mean_value + (range_value/100)))
elif mean_value >= 0 and mean_value < 1:
ax.annotate(text, xy=(-5, mean_value))
elif mean_value >= 0:
ax.annotate(text, xy=(-5, mean_value * (range_value/100)))
else:
ax.annotate(text, xy=(-5, mean_value / (range_value/100)))
return mean_value
def makeChart(ax, history, field, idx=0):
ax.plot(history[field])
# Details
mean_value = highlightAvg(ax, history, field, idx) # Highlight mean value
minmax_legend = labelMaxMin(ax, history, field) # Label min/max
# Display legend
if idx == 0:
metric_legend = Line2D([0], [0], lw=2, color='blue', label=f'{field}')
mean_legend = Line2D([0], [0], lw=2, color='blue', linestyle='dotted', label=f'Average: {mean_value:.2f}')
else:
metric_legend = Line2D([0], [0], lw=2, color='orange', label=f'{field}')
mean_legend = Line2D([0], [0], lw=2, color='orange', linestyle='dotted', label=f'Average: {mean_value:.2f}')
legend_elements = [metric_legend, minmax_legend, mean_legend]
return legend_elements
# Loss and accuracy plots
def plotResult(history, fields):
fig, ax = plt.subplots(1, 2, figsize=(18, 8))
for i in range(len(fields)):
# Check if its a nested list e.g. ([['success_rate', 'landing_rate'], 'others'])
if len(fields[i]) > 1:
# Plot chart with two fields
legend_element = []
for field in fields[i]:
idx = fields[i].index(field)
legend = makeChart(ax[i], history, field, idx)
legend_element.append(legend)
# Label legend
flattened_list = []
for sublist in legend_element:
for item in sublist:
flattened_list.append(item)
ax[i].legend(handles=flattened_list)
# Label
if 'rate' in fields[0][0] or 'rate' in fields[i][1]:
ax[i].set_ylabel('percentage')
if len(history[field]) == 100:
ax[i].set_xlabel('Past 100')
else:
ax[i].set_xlabel('Episodes')
ax[i].set_title(f'{fields[0][0].capitalize()} and {fields[0][1].capitalize()}')
else: # Plot chart with one field
field = fields[i][0]
legend = makeChart(ax[i], history, field)
ax[i].legend(handles=legend)
ax[i].set_ylabel(field)
if len(history[field]) == 100:
ax[i].set_xlabel('Past 100')
else:
ax[i].set_xlabel('Episodes')
ax[i].set_title(f'{field.capitalize()}')
plt.tight_layout()
plt.show()
# Final result
flattened_list = []
for sublist in fields:
for item in sublist:
flattened_list.append(item)
print(f'Past {len(history[flattened_list[0]])} Episodes')
print('====================================')
for field in flattened_list:
print(f'Final {field}: {history[field][-1]:.2f}')
saveJSON(results_DQN, 'dict_DQN.json')
Video example of reward > 200 (Land successfully + Land between flag + Land relatively quickly)
show_video('DQN')
# Plot result
dict_DQN = loadJSON('dict_DQN.json')
sns.set_style("whitegrid")
plotResult(dict_DQN, [['success_rate_SMA100', 'landing_rate_SMA100'], ['scores_SMA100']])
Past 1951 Episodes ==================================== Final success_rate_SMA100: 72.00 Final landing_rate_SMA100: 89.00 Final scores_SMA100: 201.09
Observation:
We observed that the agent began to learn how to land with both feets around Episode 700 and successfully landed within the flags by Episode 1000. The landing rate improved rapidly, outpacing the success rate around Episode 1200. In this context, variance refers to the difference between landing and success rates. High variance indicates that the agent has mastered landing techniques, but is not as efficient, as it may waste a significant amount of reward. For example, if the agent lands 90% of the time, but its success rate remains low (less than 200 rewards), its efficiency would be questionable. Conversely, low variance implies that the landing and success rates are closely aligned, suggesting that the agent is similar effectiveness and efficient.
We can see that after Episode 1200, the variance starts to increase before decreasing. Implying that the agent is starting to land efficiently after landing how to land effectively.
Double Deep Q-Learning Network (DDQN) is a variant of the deep Q-network (DQN) algorithm. DDQN addresses the problem of overestimation of action values that can occur in standard DQN, leading to suboptimal policies.
In DQN, the action-value function is updated using the maximum estimated Q-value of the next state obtained from the target network. This can lead to overestimation of the Q-values.
In DDQN, the action-value function is updated using two networks, the local network to select the action and the target network to estimate the expected future reward for that action. This separation of action selection and value estimation reduces the potential for overestimation of Q-values.
[References: Double Deep Q Networks]
BUFFER_SIZE = int(1e5) # replay buffer size
BATCH_SIZE = 64 # minibatch size
GAMMA = 0.99 # discount factor
TAU = 1e-3 # for soft update of target parameters
LR = 0.0005 # learning rate
UPDATE_EVERY = 4 # how often to update the network
class DDQN(nn.Module):
"""
Deep Reinforcement Learning with Double Q-Learning by Hasselt et al. (2016)
Double Deep Q-Network Model Graph
The neural network is a function from state space $R^{n_states}$ to action space $R^{n_actions}$
"""
def __init__(self, n_states, n_actions, hidden_size=32):
super(DDQN, self).__init__()
self.n_actions = n_actions
self.hidden_size = hidden_size
# hidden representation
self.dense_layer_1 = nn.Linear(n_states, hidden_size)
self.dense_layer_2 = nn.Linear(hidden_size, hidden_size)
self.dense_layer_3 = nn.Linear(hidden_size, hidden_size)
# V(s)
self.v_layer_1 = nn.Linear(hidden_size, hidden_size)
self.v_layer_2 = nn.Linear(hidden_size, hidden_size // 2)
self.v_layer_3 = nn.Linear(hidden_size // 2, 1)
# A(s, a)
self.a_layer_1 = nn.Linear(hidden_size, hidden_size)
self.a_layer_2 = nn.Linear(hidden_size, hidden_size // 2)
self.a_layer_3 = nn.Linear(hidden_size // 2, n_actions)
def forward(self, state):
x = F.relu(self.dense_layer_1(state))
x = F.relu(self.dense_layer_2(x))
x = F.relu(self.dense_layer_3(x))
v = F.relu(self.v_layer_1(x))
v = F.relu(self.v_layer_2(v))
v = self.v_layer_3(v)
a = F.relu(self.a_layer_1(x))
a = F.relu(self.a_layer_2(a))
a = self.a_layer_3(a)
return v + a - a.mean(dim=-1, keepdim=True).expand(-1, self.n_actions)
Metrics Legend:
# Reset seed - Use same seed for all experiments (objective comparison)
np.random.seed(0)
env = gym.make('LunarLander-v2',enable_wind=True)
# DDQN
agent = Agent(state_dim=8, action_dim=4, hidden_dim=64, network=DDQN)
results_DDQN = train_agent(display_every=200, max_t=1000, video_filepath='DDQN')
0%| | 0/3000 [00:00<?, ?it/s]
Episode 200 Avg Score (SMA100): -282.159 Current Score: -202 Avg Episode Length (SMA100): 122.19 Current Episode Length: 191.000 Landing Rate: 0% | Success Rate: 0% Episode 400 Avg Score (SMA100): -96.351 Current Score: 182 Avg Episode Length (SMA100): 395.26 Current Episode Length: 753.000 Landing Rate: 2% | Success Rate: 0% Moviepy - Building video video/DDQN.mp4. Moviepy - Writing video video/DDQN.mp4
t: 0%| | 0/541 [00:00<?, ?it/s, now=None]
t: 13%|█▎ | 70/541 [00:00<00:00, 694.46it/s, now=None]
t: 38%|███▊ | 205/541 [00:00<00:00, 1077.03it/s, now=None]
t: 63%|██████▎ | 342/541 [00:00<00:00, 1208.82it/s, now=None]
t: 89%|████████▊ | 480/541 [00:00<00:00, 1272.71it/s, now=None]
Moviepy - Done ! Moviepy - video ready video/DDQN.mp4
Episode 600 Avg Score (SMA100): -142.619 Current Score: -159 Avg Episode Length (SMA100): 403.16 Current Episode Length: 253.000 Landing Rate: 4% | Success Rate: 1% Episode 800 Avg Score (SMA100): -143.774 Current Score: -269 Avg Episode Length (SMA100): 445.92 Current Episode Length: 800.000 Landing Rate: 4% | Success Rate: 0% Moviepy - Building video video/DDQN.mp4. Moviepy - Writing video video/DDQN.mp4
t: 0%| | 0/952 [00:00<?, ?it/s, now=None]
t: 7%|▋ | 66/952 [00:00<00:01, 655.88it/s, now=None]
t: 21%|██ | 199/952 [00:00<00:00, 1046.95it/s, now=None]
t: 35%|███▍ | 331/952 [00:00<00:00, 1163.34it/s, now=None]
t: 49%|████▉ | 470/952 [00:00<00:00, 1250.04it/s, now=None]
t: 63%|██████▎ | 601/952 [00:00<00:00, 1268.47it/s, now=None]
t: 77%|███████▋ | 732/952 [00:00<00:00, 1282.38it/s, now=None]
t: 91%|█████████▏| 869/952 [00:00<00:00, 1307.53it/s, now=None]
Moviepy - Done ! Moviepy - video ready video/DDQN.mp4
Episode 1000 Avg Score (SMA100): -175.103 Current Score: -205 Avg Episode Length (SMA100): 484.74 Current Episode Length: 519.000 Landing Rate: 0% | Success Rate: 0% Episode 1200 Avg Score (SMA100): -92.123 Current Score: -170 Avg Episode Length (SMA100): 594.58 Current Episode Length: 682.000 Landing Rate: 17% | Success Rate: 3% Moviepy - Building video video/DDQN.mp4. Moviepy - Writing video video/DDQN.mp4
t: 0%| | 0/1000 [00:00<?, ?it/s, now=None]
t: 6%|▌ | 59/1000 [00:00<00:01, 589.77it/s, now=None]
t: 19%|█▊ | 186/1000 [00:00<00:00, 985.12it/s, now=None]
t: 31%|███ | 311/1000 [00:00<00:00, 1103.29it/s, now=None]
t: 44%|████▍ | 444/1000 [00:00<00:00, 1188.77it/s, now=None]
t: 56%|█████▋ | 563/1000 [00:00<00:00, 1186.31it/s, now=None]
t: 69%|██████▉ | 689/1000 [00:00<00:00, 1208.35it/s, now=None]
t: 82%|████████▏ | 815/1000 [00:00<00:00, 1221.20it/s, now=None]
t: 95%|█████████▍| 946/1000 [00:00<00:00, 1247.59it/s, now=None]
Moviepy - Done ! Moviepy - video ready video/DDQN.mp4
Episode 1400 Avg Score (SMA100): 93.774 Current Score: -15 Avg Episode Length (SMA100): 614.15 Current Episode Length: 91.000 Landing Rate: 65% | Success Rate: 30% Episode 1600 Avg Score (SMA100): 79.704 Current Score: -105 Avg Episode Length (SMA100): 551.25 Current Episode Length: 706.000 Landing Rate: 56% | Success Rate: 30% Moviepy - Building video video/DDQN.mp4. Moviepy - Writing video video/DDQN.mp4
t: 0%| | 0/318 [00:00<?, ?it/s, now=None]
t: 16%|█▋ | 52/318 [00:00<00:00, 515.84it/s, now=None]
t: 55%|█████▌ | 176/318 [00:00<00:00, 936.30it/s, now=None]
t: 98%|█████████▊| 313/318 [00:00<00:00, 1128.92it/s, now=None]
Moviepy - Done ! Moviepy - video ready video/DDQN.mp4
Episode 1800 Avg Score (SMA100): 42.873 Current Score: -23 Avg Episode Length (SMA100): 538.49 Current Episode Length: 999.000 Landing Rate: 46% | Success Rate: 21% Episode 2000 Avg Score (SMA100): 101.315 Current Score: 268 Avg Episode Length (SMA100): 535.4 Current Episode Length: 204.000 Landing Rate: 65% | Success Rate: 46% Moviepy - Building video video/DDQN.mp4. Moviepy - Writing video video/DDQN.mp4
t: 0%| | 0/878 [00:00<?, ?it/s, now=None]
t: 6%|▌ | 52/878 [00:00<00:01, 515.96it/s, now=None]
t: 20%|█▉ | 173/878 [00:00<00:00, 920.27it/s, now=None]
t: 34%|███▍ | 298/878 [00:00<00:00, 1068.02it/s, now=None]
t: 48%|████▊ | 422/878 [00:00<00:00, 1135.10it/s, now=None]
t: 62%|██████▏ | 548/878 [00:00<00:00, 1154.23it/s, now=None]
t: 77%|███████▋ | 676/878 [00:00<00:00, 1192.36it/s, now=None]
t: 92%|█████████▏| 805/878 [00:00<00:00, 1223.43it/s, now=None]
Moviepy - Done ! Moviepy - video ready video/DDQN.mp4
Episode 2200 Avg Score (SMA100): 167.124 Current Score: 239 Avg Episode Length (SMA100): 375.89 Current Episode Length: 379.000 Landing Rate: 77% | Success Rate: 61% Environment solved in 2312 episodes! Avg Score (SMA100): 201.737 Current Score: 237 Avg Episode Length (SMA100): 336.03 Current Episode Length: 223.000 Landing Rate: 86% | Success Rate: 71%
saveJSON(results_DDQN, 'results_DDQN.json')
show_video('DDQN')
results_DDQN = loadJSON('results_DDQN.json')
# Plot result
results_DDQN = loadJSON('results_DDQN.json')
sns.set_style("whitegrid")
plotResult(results_DDQN, [['success_rate_SMA100', 'landing_rate_SMA100'], ['scores_SMA100']])
Past 2312 Episodes ==================================== Final success_rate_SMA100: 71.00 Final landing_rate_SMA100: 86.00 Final scores_SMA100: 201.74
Observation:
We observed that the agent began to learn how to land with both feets around Episode 400 and successfully landed within the flags by Episode 420. The landing rate improved rapidly, outpacing the success rate around Episode 1300.
Compared to DQN plots, the training landing rate and success rate of the agent may appear unstable. However, a closer inspection reveals that the agent may be learning to land more efficiently, as demonstrated by the decrease in landing rate in Episode 1300, where the success rate decline is not as significant. This suggests that the agent is focusing on landing more efficiently rather than just effectively.
Actor-Critic is a popular reinforcement learning (RL) algorithm that combines both value-based and policy-based methods. The Actor refers to the policy network that maps the current state of the environment to an action, while the Critic is a value network that estimates the expected reward of a given state-action pair.
Proximal Policy Optimization (PPO) is an algorithm that can be used to improve the Actor-Critic algorithm by controlling the step size between consecutive policies in the optimization process. In traditional policy gradient algorithms, there is a risk of updating the policy too much, leading to a destabilization of the learning process. PPO addresses this issue by using a surrogate objective function that limits the step size between consecutive policies, allowing for more stable and efficient learning.
| Symbol | Meaning |
|---|---|
| $\pi_{\theta}$ | The policy represented by the parameter vector $\theta$ |
| $s_t$ | The state at time t |
| $A_t$ | The action taken at time t |
| $R_t$ | The reward at time t |
| $\gamma$ | The discount factor, which determines the importance of future rewards relative to immediate rewards (0 < $\gamma$ $\leq$ 1) |
| $J(\theta)$ | The objective function to be optimized in PPO |
PPO is a reinforcement learning algorithm that improves upon the traditional policy gradient methods. PPO aims to stabilize the policy update process and avoid oscillations that can occur with traditional policy gradient methods.
The objective function in PPO is given by:
$J(\theta) = \mathbb{E}_{t}[\text{min}(r_t(\theta)\cdot A_t,\text{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\cdot A_t)]$
where:
The objective function is a combination of the surrogate objective, $r_t(\theta)\cdot A_t$, and the clipping function, $\text{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\cdot A_t$. The surrogate objective encourages the improvement of the current policy, while the clipping function acts as a constraint that limits the magnitude of the policy update and helps stabilize the learning process.
PPO is an improvement over the traditional Q-function approach because it provides a more stable and effective way to update the policy, reducing the risk of oscillation and divergence. Additionally, PPO is computationally efficient and easier to implement compared to other reinforcement learning algorithms, making it a popular choice for real-world applications.
[References: Proximal Policy Optimization] [References: The Actor Critic Reinforcement Learning Algorithm]
class MLP(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim, dropout = 0.1):
super().__init__()
self.net = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.Dropout(dropout),
# PReLU -> Variant of LeakyReLU
nn.PReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.Dropout(dropout),
nn.PReLU(),
nn.Linear(hidden_dim, output_dim)
)
def forward(self, x):
x = self.net(x)
return x
class ActorCritic(nn.Module):
def __init__(self, actor, critic):
super().__init__()
self.actor = actor
self.critic = critic
def forward(self, state):
action_pred = self.actor(state)
value_pred = self.critic(state)
return action_pred, value_pred
def init_weights(m):
if type(m) == nn.Linear:
torch.nn.init.xavier_normal_(m.weight)
m.bias.data.fill_(0)
Returns are used to evaluate the quality of a policy and to provide a signal for updating the policy network. The return is the discounted sum of future rewards, and it provides information about how well the policy is performing. The policy network is updated to maximize the expected return, which is the sum of future rewards expected under the current policy. The optimization process adjusts the parameters of the policy network so that it predicts higher probabilities for actions that lead to higher returns.
def calculate_returns(rewards, discount_factor, normalize = True):
returns = []
R = 0
for r in reversed(rewards):
R = r + R * discount_factor
returns.insert(0, R)
returns = torch.tensor(returns)
if normalize:
returns = (returns - returns.mean()) / returns.std()
return returns
PPO advantages are used to adjust the probability of taking a particular action in a state. The policy network outputs a probability distribution over actions, and the advantages are used to adjust this distribution to favor actions that lead to higher rewards.
def calculate_advantages(returns, values, normalize = True):
advantages = returns - values
if normalize:
advantages = (advantages - advantages.mean()) / advantages.std()
return advantages
By using both advantages and returns, PPO balances the trade-off between exploration and exploitation. The policy network explores the environment by trying new actions, and it exploits the knowledge gained from past experiences by favoring actions that lead to higher rewards. Over time, the policy network learns to take actions that lead to higher returns, leading to an improvement in the overall quality of the policy. This technique is used to replace the traditional epsilon greedy method.
Update agent code
def update_policy(policy, states, actions, log_prob_actions, advantages, returns, optimizer, ppo_steps, ppo_clip):
states = states.detach()
actions = actions.detach()
log_prob_actions = log_prob_actions.detach()
advantages = advantages.detach()
returns = returns.detach()
for _ in range(ppo_steps):
#get new log prob of actions for all input states
action_pred, value_pred = policy(states)
value_pred = value_pred.squeeze(-1)
action_prob = F.softmax(action_pred, dim = -1)
dist = distributions.Categorical(action_prob)
#new log prob using old actions
new_log_prob_actions = dist.log_prob(actions)
policy_ratio = (new_log_prob_actions - log_prob_actions).exp()
policy_loss_1 = policy_ratio * advantages
policy_loss_2 = torch.clamp(policy_ratio, min = 1.0 - ppo_clip, max = 1.0 + ppo_clip) * advantages
policy_loss = - torch.min(policy_loss_1, policy_loss_2).mean().to(device)
value_loss = F.smooth_l1_loss(returns, value_pred).mean().to(device)
optimizer.zero_grad()
policy_loss.backward()
value_loss.backward()
optimizer.step()
def save_video_PPO(policy, file_name, model_ckpt= 'checkpoint_best.pth',render_mode="rgb_array", max_t = 1000, seed = 0):
env = gym.make('LunarLander-v2',enable_wind=True,render_mode="rgb_array")
vid = video_recorder.VideoRecorder(env, path="video/{}.mp4".format(file_name))
policy.load_state_dict(torch.load('./models/' + model_ckpt))
state = env.reset(seed = seed)[0]
done = False
t = 0
rewards = 0
while not done and t != max_t:
t += 1
frame = env.render()
vid.capture_frame()
state = torch.FloatTensor(state).unsqueeze(0)
action_pred, _ = policy(state)
action_prob = F.softmax(action_pred, dim = -1)
dist = distributions.Categorical(action_prob)
action = dist.sample()
state, reward, done, _, _ = env.step(action.item())
rewards += reward
env.close()
return rewards
def train_policy(env,policy, optimizer,
discount_factor=0.99, ppo_steps=5, ppo_clip=0.2,
n_episodes=1000, max_t=1000, model_name='PPO_ActorCritic',
display_every=100):
np.random.seed(0)
# Put model to train
policy.train()
# Metris variables
scores = [] # list containing scores from each episode
scores_SMA100 = []
scores_window = deque(maxlen=100) # last 100 scores
time_taken = []
time_taken_window = deque(maxlen=100)
success_rate = deque(maxlen=100)
landing_rate = deque(maxlen=100)
success_rate_SMA100 = []
landing_rate_SMA100 = []
for i_episode in tqdm(range(1, n_episodes+1)):
score = 0
state = env.reset()[0]
# Policy variables
states = []
actions = []
log_prob_actions = []
values = []
rewards = []
for t in range(max_t):
state = torch.FloatTensor(state).unsqueeze(0)
#append state here, not after we get the next state from env.step()
states.append(state)
action_pred, value_pred = policy(state)
action_prob = F.softmax(action_pred, dim = -1)
dist = distributions.Categorical(action_prob)
action = dist.sample()
log_prob_action = dist.log_prob(action)
state, reward, done, _, _ = env.step(action.item())
actions.append(action)
log_prob_actions.append(log_prob_action)
values.append(value_pred)
rewards.append(reward)
score += reward
if done:
if score >= 200:
success_rate.append(1)
landing_rate.append(1)
elif score >= 120:
success_rate.append(0)
landing_rate.append(1)
else:
success_rate.append(0)
landing_rate.append(0)
break
### Record Metrics ###
scores_window.append(score) # save most recent score
scores_SMA100.append(np.mean(scores_window))
scores.append(score) # save most recent score
time_taken.append(t)
time_taken_window.append(t)
landing_rate_SMA100.append(landing_rate.count(1))
success_rate_SMA100.append(success_rate.count(1))
states = torch.cat(states)
actions = torch.cat(actions)
log_prob_actions = torch.cat(log_prob_actions)
values = torch.cat(values).squeeze(-1)
returns = calculate_returns(rewards, discount_factor)
advantages = calculate_advantages(returns, values)
update_policy(policy, states, actions, log_prob_actions, advantages, returns, optimizer, ppo_steps, ppo_clip)
if i_episode % display_every == 0:
# SMA100: Average of past 100 period (Simple Moving Average)
print(f'\rEpisode {i_episode}\tAvg Score (SMA100): {np.mean(scores_window):.3f} Current Score: {scores_window[-1]:.0f}\nAvg Episode Length (SMA100): {np.mean(time_taken_window)} Current Episode Length: {time_taken_window[-1]:.0f}\nLanding Rate: {landing_rate.count(1):.0f}% | Success Rate: {success_rate.count(1):.0f}%\n')
torch.save(policy.state_dict(), f'.\models\{model_name+str(i_episode)}_train.pth')
if i_episode % (display_every*2) == 0:
save_video_PPO(policy, 'PPO', f'{model_name+str(i_episode)}_train.pth',max_t=max_t)
elif i_episode % (display_every*2+1) == 0:
show_video('PPO', 200)
if np.mean(scores_window)>=200.0:
print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)))
torch.save(policy.state_dict(), f'.\models\{model_name}_best.pth')
break
return {
'scores': scores, 'scores_SMA100': scores_SMA100,
'scores_window': scores_window, 'time_taken': time_taken,
'time_taken_window': time_taken_window, 'success_rate': success_rate,
'landing_rate': landing_rate, 'landing_rate_SMA100': landing_rate_SMA100,
'success_rate_SMA100': success_rate_SMA100
}
Metrics Legend:
actor = MLP(8, 128, 4).to(device)
critic = MLP(8, 128, 1).to(device)
policy = ActorCritic(actor, critic)
policy.apply(init_weights)
optimizer = optim.Adam(policy.parameters(), lr = LR)
np.random.seed(0)
env = gym.make('LunarLander-v2',enable_wind=True)
results_PPO = train_policy(env, policy, optimizer, 0.99, 5, 0.2, 2000, 1000,'PPO_Actor_Critic',300)
0%| | 0/2000 [00:00<?, ?it/s]
Episode 300 Avg Score (SMA100): -114.343 Current Score: -231 Avg Episode Length (SMA100): 409.63 Current Episode Length: 430 Landing Rate: 6% | Success Rate: 0% Episode 600 Avg Score (SMA100): -3.890 Current Score: 33 Avg Episode Length (SMA100): 626.34 Current Episode Length: 193 Landing Rate: 5% | Success Rate: 1% Moviepy - Building video video/PPO.mp4. Moviepy - Writing video video/PPO.mp4
t: 0%| | 0/257 [00:00<?, ?it/s, now=None]
t: 20%|█▉ | 51/257 [00:00<00:00, 505.48it/s, now=None]
t: 68%|██████▊ | 174/257 [00:00<00:00, 925.39it/s, now=None]
Moviepy - Done ! Moviepy - video ready video/PPO.mp4
Episode 900 Avg Score (SMA100): 40.169 Current Score: -14 Avg Episode Length (SMA100): 773.39 Current Episode Length: 191 Landing Rate: 24% | Success Rate: 6% Episode 1200 Avg Score (SMA100): 17.723 Current Score: -48 Avg Episode Length (SMA100): 737.13 Current Episode Length: 628 Landing Rate: 44% | Success Rate: 7% Moviepy - Building video video/PPO.mp4. Moviepy - Writing video video/PPO.mp4
t: 0%| | 0/1000 [00:00<?, ?it/s, now=None]
t: 6%|▌ | 59/1000 [00:00<00:01, 586.28it/s, now=None]
t: 19%|█▉ | 190/1000 [00:00<00:00, 1008.61it/s, now=None]
t: 32%|███▏ | 320/1000 [00:00<00:00, 1140.16it/s, now=None]
t: 46%|████▌ | 455/1000 [00:00<00:00, 1221.67it/s, now=None]
t: 59%|█████▉ | 590/1000 [00:00<00:00, 1265.10it/s, now=None]
t: 73%|███████▎ | 728/1000 [00:00<00:00, 1300.41it/s, now=None]
t: 86%|████████▌ | 862/1000 [00:00<00:00, 1311.14it/s, now=None]
t: 100%|█████████▉| 999/1000 [00:00<00:00, 1328.23it/s, now=None]
Moviepy - Done ! Moviepy - video ready video/PPO.mp4
Episode 1500 Avg Score (SMA100): 29.085 Current Score: 114 Avg Episode Length (SMA100): 615.67 Current Episode Length: 999 Landing Rate: 39% | Success Rate: 6% Episode 1800 Avg Score (SMA100): 48.599 Current Score: 55 Avg Episode Length (SMA100): 812.39 Current Episode Length: 999 Landing Rate: 32% | Success Rate: 3% Moviepy - Building video video/PPO.mp4. Moviepy - Writing video video/PPO.mp4
t: 0%| | 0/720 [00:00<?, ?it/s, now=None]
t: 3%|▎ | 25/720 [00:00<00:02, 246.27it/s, now=None]
t: 12%|█▏ | 86/720 [00:00<00:01, 457.53it/s, now=None]
t: 30%|██▉ | 215/720 [00:00<00:00, 834.97it/s, now=None]
t: 48%|████▊ | 345/720 [00:00<00:00, 1015.84it/s, now=None]
t: 66%|██████▋ | 477/720 [00:00<00:00, 1123.03it/s, now=None]
t: 84%|████████▍ | 605/720 [00:00<00:00, 1175.78it/s, now=None]
Moviepy - Done ! Moviepy - video ready video/PPO.mp4
actor = MLP(8, 128, 4).to(device)
critic = MLP(8, 128, 1).to(device)
policy = ActorCritic(actor, critic)
policy.apply(init_weights)
optimizer = optim.Adam(policy.parameters(), lr = LR)
np.random.seed(0)
env = gym.make('LunarLander-v2',enable_wind=True)
results_PPO = train_policy(env, policy, optimizer, 0.99, 5, 0.2, 2000, 1000,'PPO_Actor_Critic',150)
0%| | 0/2000 [00:00<?, ?it/s]
Episode 150 Avg Score (SMA100): -276.004 Current Score: -35 Avg Episode Length (SMA100): 103.54 Current Episode Length: 91 Landing Rate: 0% | Success Rate: 0% Episode 300 Avg Score (SMA100): -170.954 Current Score: -295 Avg Episode Length (SMA100): 412.22 Current Episode Length: 481 Landing Rate: 0% | Success Rate: 0% Moviepy - Building video video/PPO.mp4. Moviepy - Writing video video/PPO.mp4
t: 0%| | 0/203 [00:00<?, ?it/s, now=None]
t: 26%|██▌ | 53/203 [00:00<00:00, 520.22it/s, now=None]
t: 82%|████████▏ | 167/203 [00:00<00:00, 880.86it/s, now=None]
Moviepy - Done ! Moviepy - video ready video/PPO.mp4
Episode 450 Avg Score (SMA100): -67.057 Current Score: -3 Avg Episode Length (SMA100): 645.19 Current Episode Length: 300 Landing Rate: 2% | Success Rate: 0% Episode 600 Avg Score (SMA100): 14.330 Current Score: 91 Avg Episode Length (SMA100): 733.44 Current Episode Length: 999 Landing Rate: 10% | Success Rate: 2% Moviepy - Building video video/PPO.mp4. Moviepy - Writing video video/PPO.mp4
t: 0%| | 0/324 [00:00<?, ?it/s, now=None]
t: 19%|█▊ | 60/324 [00:00<00:00, 596.32it/s, now=None]
t: 55%|█████▌ | 179/324 [00:00<00:00, 937.27it/s, now=None]
t: 90%|█████████ | 293/324 [00:00<00:00, 1025.41it/s, now=None]
Moviepy - Done ! Moviepy - video ready video/PPO.mp4
Episode 750 Avg Score (SMA100): 51.255 Current Score: 98 Avg Episode Length (SMA100): 633.99 Current Episode Length: 999 Landing Rate: 23% | Success Rate: 4% Episode 900 Avg Score (SMA100): 62.007 Current Score: 191 Avg Episode Length (SMA100): 729.35 Current Episode Length: 291 Landing Rate: 35% | Success Rate: 8% Moviepy - Building video video/PPO.mp4. Moviepy - Writing video video/PPO.mp4
t: 0%| | 0/392 [00:00<?, ?it/s, now=None]
t: 16%|█▌ | 62/392 [00:00<00:00, 611.80it/s, now=None]
t: 47%|████▋ | 185/392 [00:00<00:00, 969.28it/s, now=None]
t: 78%|███████▊ | 306/392 [00:00<00:00, 1074.98it/s, now=None]
Moviepy - Done ! Moviepy - video ready video/PPO.mp4
Episode 1050 Avg Score (SMA100): -13.426 Current Score: 125 Avg Episode Length (SMA100): 660.98 Current Episode Length: 591 Landing Rate: 30% | Success Rate: 4% Episode 1200 Avg Score (SMA100): -52.430 Current Score: -116 Avg Episode Length (SMA100): 626.09 Current Episode Length: 611 Landing Rate: 21% | Success Rate: 2% Moviepy - Building video video/PPO.mp4. Moviepy - Writing video video/PPO.mp4
t: 0%| | 0/755 [00:00<?, ?it/s, now=None]
t: 9%|▉ | 69/755 [00:00<00:01, 684.86it/s, now=None]
t: 27%|██▋ | 202/755 [00:00<00:00, 1058.32it/s, now=None]
t: 45%|████▍ | 336/755 [00:00<00:00, 1184.66it/s, now=None]
t: 62%|██████▏ | 467/755 [00:00<00:00, 1233.71it/s, now=None]
t: 79%|███████▉ | 598/755 [00:00<00:00, 1259.08it/s, now=None]
t: 97%|█████████▋| 729/755 [00:00<00:00, 1274.51it/s, now=None]
Moviepy - Done ! Moviepy - video ready video/PPO.mp4
Episode 1350 Avg Score (SMA100): -101.266 Current Score: -180 Avg Episode Length (SMA100): 644.47 Current Episode Length: 115 Landing Rate: 14% | Success Rate: 1% Episode 1500 Avg Score (SMA100): -25.547 Current Score: -177 Avg Episode Length (SMA100): 604.42 Current Episode Length: 706 Landing Rate: 18% | Success Rate: 3% Moviepy - Building video video/PPO.mp4. Moviepy - Writing video video/PPO.mp4
t: 0%| | 0/988 [00:00<?, ?it/s, now=None]
t: 5%|▌ | 51/988 [00:00<00:01, 506.85it/s, now=None]
t: 18%|█▊ | 175/988 [00:00<00:00, 935.35it/s, now=None]
t: 30%|███ | 297/988 [00:00<00:00, 1064.84it/s, now=None]
t: 43%|████▎ | 423/988 [00:00<00:00, 1138.81it/s, now=None]
t: 56%|█████▌ | 549/988 [00:00<00:00, 1178.84it/s, now=None]
t: 68%|██████▊ | 676/988 [00:00<00:00, 1207.90it/s, now=None]
t: 82%|████████▏ | 807/988 [00:00<00:00, 1238.25it/s, now=None]
t: 95%|█████████▌| 940/988 [00:00<00:00, 1266.48it/s, now=None]
Moviepy - Done ! Moviepy - video ready video/PPO.mp4
Episode 1650 Avg Score (SMA100): -278.113 Current Score: -329 Avg Episode Length (SMA100): 683.42 Current Episode Length: 178 Landing Rate: 1% | Success Rate: 0% Episode 1800 Avg Score (SMA100): -134.757 Current Score: -44 Avg Episode Length (SMA100): 758.98 Current Episode Length: 999 Landing Rate: 3% | Success Rate: 0% Moviepy - Building video video/PPO.mp4. Moviepy - Writing video video/PPO.mp4
t: 0%| | 0/1000 [00:00<?, ?it/s, now=None]
t: 6%|▌ | 58/1000 [00:00<00:01, 576.27it/s, now=None]
t: 20%|█▉ | 196/1000 [00:00<00:00, 1046.21it/s, now=None]
t: 33%|███▎ | 330/1000 [00:00<00:00, 1178.33it/s, now=None]
t: 45%|████▌ | 454/1000 [00:00<00:00, 1200.67it/s, now=None]
t: 59%|█████▉ | 590/1000 [00:00<00:00, 1252.87it/s, now=None]
t: 72%|███████▏ | 723/1000 [00:00<00:00, 1278.88it/s, now=None]
t: 86%|████████▌ | 860/1000 [00:00<00:00, 1305.59it/s, now=None]
Moviepy - Done ! Moviepy - video ready video/PPO.mp4
Episode 1950 Avg Score (SMA100): 78.886 Current Score: 78 Avg Episode Length (SMA100): 789.49 Current Episode Length: 999 Landing Rate: 38% | Success Rate: 5%
saveJSON(results_PPO, 'results_PPO.json')
show_video('PPO')
# Plot result
results_PPO = loadJSON('results_PPO.json')
sns.set_style("whitegrid")
plotResult(results_PPO, [['success_rate_SMA100', 'landing_rate_SMA100'], ['scores_SMA100']])
Past 2000 Episodes ==================================== Final success_rate_SMA100: 5.00 Final landing_rate_SMA100: 51.00 Final scores_SMA100: 75.81
Observation:
We observed that the agent began to learn how to land with both feets around Episode 400 and successfully landed within the flags by Episode 420. The landing rate improved rapidly, outpacing the success rate around Episode 500.
Compared to other plots, the training landing rate and success rate of the agent appear to be unstable, with a decline starting from Episode 850. This trend continues until Episode 1750, which might be attributed to the nature of PPO clipping. These results indicate that the Actor & Critic PPO may not be the best choice, particularly in an environment with only 4 actions.
Prioritized Experience Replay (PER) is a modification to the traditional Experience Replay (ER) algorithm in Reinforcement Learning. The main difference between the two is the way experiences are stored and sampled from the replay buffer.
In ER, experiences are stored in a fixed-size buffer and randomly sampled from this buffer to train the agent. This can result in inefficient use of the experiences, as the agent may repeatedly sample low-impact experiences, while high-impact experiences are neglected.
In PER, experiences are assigned a priority value based on their estimated impact on the agent's learning. High-impact experiences are assigned a higher priority, and are therefore more likely to be sampled and used to update the agent's policy. This leads to more efficient use of experiences, as the agent focuses on learning from the most impactful experiences.
PER can result in improved performance compared to ER, as the agent is able to learn more effectively from the experiences that are most valuable to its learning process.
Resources: PER Original Implementation, 2015
Code for PER
class PriortizationReplayBuffer:
"""Fixed-size buffer to store experience tuples."""
def __init__(self, state_dim, action_dim, buffer_size, batch_size, priority=False):
"""Initialize a ReplayBuffer object.
Params
======
action_dim (int): dimension of each action
buffer_dim (int): maximum size of buffer (chosen as multiple of num agents)
batch_size (int): size of each training batch
seed (int): random seed
"""
self.states = torch.zeros((buffer_size,)+(state_dim,)).to(device)
self.next_states = torch.zeros((buffer_size,)+(state_dim,)).to(device)
self.actions = torch.zeros(buffer_size,1, dtype=torch.long).to(device)
self.rewards = torch.zeros(buffer_size, 1, dtype=torch.float).to(device)
self.dones = torch.zeros(buffer_size, 1, dtype=torch.float).to(device)
self.e = np.zeros((buffer_size, 1), dtype=np.float32)
self.priority = priority
self.ptr = 0
self.n = 0
self.buffer_size = buffer_size
self.batch_size = batch_size
def add(self, state, action, reward, next_state, done):
"""Add a new experience to memory."""
self.states[self.ptr] = torch.from_numpy(state).to(device)
self.next_states[self.ptr] = torch.from_numpy(next_state).to(device)
self.actions[self.ptr] = torch.from_numpy(np.asarray(action)).to(device)
self.rewards[self.ptr] = torch.from_numpy(np.asarray(reward)).to(device)
self.dones[self.ptr] = done
self.ptr += 1
if self.ptr >= self.buffer_size:
self.ptr = 0
self.n = self.buffer_size
def sample(self, get_all=False):
"""Randomly sample a batch of experiences from memory."""
n = len(self)
if get_all:
return self.states[:n], self.actions[:n], self.rewards[:n], self.next_states[:n], self.dones[:n]
# else:
if self.priority:
idx = np.random.choice(n, self.batch_size, replace=False, p=self.e)
else:
idx = np.random.choice(n, self.batch_size, replace=False)
states = self.states[idx]
next_states = self.next_states[idx]
actions = self.actions[idx]
rewards = self.rewards[idx]
dones = self.dones[idx]
return (states, actions, rewards, next_states, dones), idx
def update_error(self, e, idx=None):
e = torch.abs(e.detach())
e = e / e.sum()
if idx is not None:
self.e[idx] = e.cpu().numpy()
else:
self.e[:len(self)] = e.cpu().numpy()
def __len__(self):
if self.n == 0:
return self.ptr
else:
return self.n
Edited Agent to incorporate PER
class PTRAgent:
"""Interacts with and learns from the environment."""
def __init__(self, state_size, action_size, hidden_dim,network, LR, weight_decay,priority=True):
"""Initialize an Agent object.
Params
======
state_size (int): dimension of each state
action_size (int): dimension of each action
seed (int): random seed
"""
self.state_size = state_size
self.action_size = action_size
self.qnetwork_online = network(state_size, action_size, hidden_dim).to(device)
self.qnetwork_target = network(state_size, action_size, hidden_dim).to(device)
self.optimizer = optim.Adam(self.qnetwork_online.parameters(), lr=LR, weight_decay=weight_decay)
# Replay memory
self.memory = PriortizationReplayBuffer(state_size, (action_size,), BUFFER_SIZE, BATCH_SIZE)
# Initialize time step (for updating every UPDATE_EVERY steps)
self.t_step = 0
def step(self, state, action, reward, next_state, done):
# Save experience in replay memory
self.memory.add(state, action, reward, next_state, done)
# Learn every UPDATE_EVERY time steps.
self.t_step = (self.t_step + 1) % UPDATE_EVERY
if self.t_step == 0:
# If enough samples are available in memory, get random subset and learn
if len(self.memory) > BATCH_SIZE:
experiences, idx = self.memory.sample()
e = self.learn(experiences)
self.memory.update_error(e, idx)
def act(self, state, eps=0.):
"""Returns actions for given state as per current policy.
Params
======
state (array_like): current state
eps (float): epsilon, for epsilon-greedy action selection
"""
state = torch.from_numpy(state).float().unsqueeze(0).to(device)
self.qnetwork_online.eval()
with torch.no_grad():
action_values = self.qnetwork_online(state)
self.qnetwork_online.train()
# Epsilon-greedy action selection
if random.random() > eps:
return np.argmax(action_values.cpu().data.numpy())
else:
return random.choice(np.arange(self.action_size))
def update_error(self):
states, actions, rewards, next_states, dones = self.memory.sample(get_all=True)
with torch.no_grad():
maxQ = self.qnetwork_target(next_states).max(-1, keepdim=True)[0]
target = rewards+GAMMA*maxQ*(1-dones)
old_val = self.qnetwork_online(states).gather(-1, actions)
e = old_val - target
self.memory.update_error(e)
def learn(self, experiences):
"""Update value parameters using given batch of experience tuples.
Params
======
experiences (Tuple[torch.Variable]): tuple of (s, a, r, s', done) tuples
gamma (float): discount factor
"""
states, actions, rewards, next_states, dones = experiences
## compute and minimize the loss
self.optimizer.zero_grad()
with torch.no_grad():
maxQ = self.qnetwork_target(next_states).max(-1, keepdim=True)[0]
target = rewards+GAMMA*maxQ*(1-dones)
old_val = self.qnetwork_online(states).gather(-1, actions)
loss = F.mse_loss(old_val, target)
loss.backward()
self.optimizer.step()
# ------------------- update target network ------------------- #
self.soft_update(self.qnetwork_online, self.qnetwork_target, TAU)
return old_val - target
def soft_update(self, local_model, target_model, tau):
"""Soft update model parameters.
θ_target = τ*θ_local + (1 - τ)*θ_target
Params
======
local_model (PyTorch model): weights will be copied from
target_model (PyTorch model): weights will be copied to
tau (float): interpolation parameter
"""
for target_param, local_param in zip(target_model.parameters(), local_model.parameters()):
target_param.data.copy_(tau*local_param.data + (1.0-tau)*target_param.data)
Metrics Legend:
# Reset seed - Use same seed for all experiments (objective comparison)
np.random.seed(0)
env = gym.make('LunarLander-v2',enable_wind=True)
# DQN + PTR
agent = PTRAgent(8, 4, hidden_dim=64, network=QNetwork, 0.0005, 0.0000001)
results_DQN_PTR = train_agent(display_every=200, max_t=1000, video_filepath='DQN_PTR')
C:\Users\quahj\AppData\Local\Temp\ipykernel_11752\3117275958.py:18: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations self.e = np.zeros((buffer_size, 1), dtype=np.float)
0%| | 0/3000 [00:00<?, ?it/s]
Episode 200 Avg Score (SMA100): -210.796 Current Score: -184 Avg Episode Length (SMA100): 136.32 Current Episode Length: 206.000 Landing Rate: 0% | Success Rate: 0% Episode 400 Avg Score (SMA100): -105.701 Current Score: -270 Avg Episode Length (SMA100): 445.82 Current Episode Length: 747.000 Landing Rate: 2% | Success Rate: 0% Moviepy - Building video video/DQN_PTR.mp4. Moviepy - Writing video video/DQN_PTR.mp4
t: 0%| | 0/912 [00:00<?, ?it/s, now=None]
t: 4%|▍ | 38/912 [00:00<00:02, 376.63it/s, now=None]
t: 13%|█▎ | 121/912 [00:00<00:01, 640.01it/s, now=None]
t: 24%|██▍ | 217/912 [00:00<00:00, 782.22it/s, now=None]
t: 33%|███▎ | 299/912 [00:00<00:00, 791.88it/s, now=None]
t: 42%|████▏ | 379/912 [00:00<00:00, 769.34it/s, now=None]
t: 50%|█████ | 459/912 [00:00<00:00, 779.45it/s, now=None]
t: 59%|█████▉ | 538/912 [00:00<00:00, 775.83it/s, now=None]
t: 68%|██████▊ | 621/912 [00:00<00:00, 790.60it/s, now=None]
t: 78%|███████▊ | 713/912 [00:00<00:00, 829.19it/s, now=None]
t: 88%|████████▊ | 803/912 [00:01<00:00, 848.91it/s, now=None]
t: 99%|█████████▉| 903/912 [00:01<00:00, 893.91it/s, now=None]
Moviepy - Done ! Moviepy - video ready video/DQN_PTR.mp4
Episode 600 Avg Score (SMA100): -96.389 Current Score: -34 Avg Episode Length (SMA100): 511.77 Current Episode Length: 999.000 Landing Rate: 23% | Success Rate: 8% Episode 800 Avg Score (SMA100): -95.798 Current Score: -184 Avg Episode Length (SMA100): 621.54 Current Episode Length: 558.000 Landing Rate: 18% | Success Rate: 10% Moviepy - Building video video/DQN_PTR.mp4. Moviepy - Writing video video/DQN_PTR.mp4
t: 0%| | 0/1000 [00:00<?, ?it/s, now=None]
t: 6%|▌ | 60/1000 [00:00<00:01, 598.05it/s, now=None]
t: 18%|█▊ | 178/1000 [00:00<00:00, 937.84it/s, now=None]
t: 30%|███ | 302/1000 [00:00<00:00, 1071.69it/s, now=None]
t: 42%|████▏ | 420/1000 [00:00<00:00, 1114.21it/s, now=None]
t: 54%|█████▍ | 541/1000 [00:00<00:00, 1146.66it/s, now=None]
t: 66%|██████▌ | 656/1000 [00:00<00:00, 1129.18it/s, now=None]
t: 77%|███████▋ | 773/1000 [00:00<00:00, 1141.99it/s, now=None]
t: 89%|████████▉ | 889/1000 [00:00<00:00, 1141.97it/s, now=None]
Moviepy - Done ! Moviepy - video ready video/DQN_PTR.mp4
Episode 1000 Avg Score (SMA100): -51.242 Current Score: -118 Avg Episode Length (SMA100): 835.1 Current Episode Length: 999.000 Landing Rate: 23% | Success Rate: 13% Episode 1200 Avg Score (SMA100): 81.103 Current Score: 54 Avg Episode Length (SMA100): 457.67 Current Episode Length: 130.000 Landing Rate: 56% | Success Rate: 32% Moviepy - Building video video/DQN_PTR.mp4. Moviepy - Writing video video/DQN_PTR.mp4
t: 0%| | 0/436 [00:00<?, ?it/s, now=None]
t: 11%|█ | 47/436 [00:00<00:00, 465.57it/s, now=None]
t: 34%|███▍ | 150/436 [00:00<00:00, 791.83it/s, now=None]
t: 62%|██████▏ | 269/436 [00:00<00:00, 971.22it/s, now=None]
t: 89%|████████▉ | 389/436 [00:00<00:00, 1060.70it/s, now=None]
Moviepy - Done ! Moviepy - video ready video/DQN_PTR.mp4
Episode 1400 Avg Score (SMA100): 165.125 Current Score: 248 Avg Episode Length (SMA100): 369.12 Current Episode Length: 408.000 Landing Rate: 80% | Success Rate: 70% Episode 1600 Avg Score (SMA100): 154.341 Current Score: 234 Avg Episode Length (SMA100): 345.35 Current Episode Length: 382.000 Landing Rate: 74% | Success Rate: 66% Moviepy - Building video video/DQN_PTR.mp4. Moviepy - Writing video video/DQN_PTR.mp4
t: 0%| | 0/329 [00:00<?, ?it/s, now=None]
t: 18%|█▊ | 59/329 [00:00<00:00, 585.81it/s, now=None]
t: 50%|████▉ | 163/329 [00:00<00:00, 851.35it/s, now=None]
t: 78%|███████▊ | 257/329 [00:00<00:00, 887.86it/s, now=None]
Moviepy - Done ! Moviepy - video ready video/DQN_PTR.mp4
Environment solved in 1761 episodes! Avg Score (SMA100): 200.481 Current Score: 253 Avg Episode Length (SMA100): 357.63 Current Episode Length: 617.000 Landing Rate: 86% | Success Rate: 72%
saveJSON(results_DQN_PTR, 'dict_DQN_PTR.json')
show_video('DQN_PTR')
# Plot result
dict_DQN_PTR = loadJSON('dict_DQN_PTR.json')
sns.set_style("whitegrid")
plotResult(dict_DQN_PTR, [['success_rate_SMA100', 'landing_rate_SMA100'], ['scores_SMA100']])
Past 1761 Episodes ==================================== Final success_rate_SMA100: 72.00 Final landing_rate_SMA100: 86.00 Final scores_SMA100: 200.48
Observation:
Our observation showed that the agent started to learn how to land with both feet around Episode 400 and eventually succeeded in landing within the flags several episodes later. Compared to prior models, this DQN + PER combination appeared to be enhancing both the landing and success rate at a similar rate. The variance between these two rates throughout the episodes was much lower, indicating that the agent is efficiently and effectively learning both simultaneously.
Also, this network was able to solve the environment in the shortest number of episodes so far (1761).
As our DQN + PER managed to solve the environment in the least number of episodes out of all the models we tried, we decided to optimize its hyperparameter.
Parameters to tune:
*Note that there are many more possible hyperparameters to tune. Ultimately, our choice came down to the 5 hyperparameters that we think would play quite a big role in making our agent train better.
def NetRandomTuner(LR_range = np.logspace(1,2,num=4)/100000 * 0.5,max_alive=[600,800,1000],
model_layers=[32,64,128,160],discount_factor=[0.985,0.9875,0.99,0.9925,0.995],
weight_decay=np.logspace(1,4,num=4)/1000000000,trials=10):
global GAMMA
possible_trials=[]
for LR, MA, DF, ML, WD in itertools.product(*(LR_range,max_alive,discount_factor, model_layers,weight_decay)):
possible_trials.append([LR, MA, DF, ML, WD])
# shuffle all possible trials
random.shuffle(possible_trials)
epoch_hist = []
trial_hist = []
best_parms = [99999]
trial_count = 0
t0 = time.time()
for trial in possible_trials:
print('Next trial: ',trial)
t1 = time.time()
if trial_count == trials:
print(f'\n\nTrial ended at trial #{trial_count}')
break
trial_count += 1
np.random.seed(0)
env = gym.make('LunarLander-v2',enable_wind=True)
env.reset(seed = 0)
GAMMA = trial[2]
agent = PTRAgent(8, 4, hidden_dim=trial[3], LR=trial[0], weight_decay=trial[4], network=QNetwork)
results_DQN_PTR = train_agent(display_every=200, max_t=trial[1], n_episodes=1500)
if len(results_DQN_PTR['scores']) < best_parms[0]:
checkpoint_best = deepcopy(agent)
best_parms=[len(results_DQN_PTR['scores']),trial[0],trial[1],trial[2],trial[3],trial[4],trial_count]
best_results = results_DQN_PTR
clear_output()
print(f'''
Trial #{trial_count} Finished - Search Time {(time.time()-t1)/60:.2f} Mins
Total Time Elapsed: {(time.time()-t0)/60:.2f} Mins\n
Hyperparameters\t\t|Trial Values: #{trial_count}\t|Best Trial Values: #{best_parms[-1]}\n
Learning Rate\t\t|{trial[0]:.7f}\t\t|{best_parms[1]:.7f}
Max Time Alive\t\t|{trial[1]:.0f}\t\t\t|{best_parms[2]:.0f}
Discount Factor\t\t|{trial[2]:.4f}\t\t\t|{best_parms[3]:.4f}
Model Layers\t\t|{trial[3]:.0f}\t\t\t|{best_parms[4]:.0f}
Weight Decay\t\t|{trial[4]:.8f}\t\t|{best_parms[5]:.8f}
Epochs To Solve\t\t|{len(results_DQN_PTR['scores'])}\t\t\t|{best_parms[0]}\n\n
''')
return best_results, checkpoint_best
best_results, best_agent = NetRandomTuner(trials=30)
Trial #30 Finished - Search Time 7.85 Mins Total Time Elapsed: 611.33 Mins Hyperparameters |Trial Values: #30 |Best Trial Values: #29 Learning Rate |0.0005000 |0.0001077 Max Time Alive |600 |800 Discount Factor |0.9925 |0.9875 Model Layers |32 |32 Weight Decay |0.00000010 |0.00000100 Epoch To Solve |915 |840 Next trial: [0.00010772173450159416, 1000, 0.985, 64, 1e-08] Trial ended at trial #30
saveJSON(best_results, 'DQN_PER_tuned.json')
# torch.save(best_agent.qnetwork_online.state_dict(), f'best_hypertuned_DQN_PER.pth')
torch.save(best_agent.qnetwork_target.state_dict(), f'best_hypertuned_DQN_PER_target.pth')
# Plot result
DQN_PER = loadJSON('DQN_PER_tuned.json')
sns.set_style("whitegrid")
plotResult(DQN_PER, [['success_rate_SMA100', 'landing_rate_SMA100'], ['scores_SMA100']])
Past 840 Episodes ==================================== Final success_rate_SMA100: 71.00 Final landing_rate_SMA100: 97.00 Final scores_SMA100: 200.67
Observation:
After hyperparameter tuning, our model was able to solve the environment in only 840 Episodes, which is a significant improvement from before.
Our observation showed that the agent started to learn how to land with both feet and solve the environment at around Episode 200. Compared to our DQN + PER before hyperparameter tuning, this tuned model is able to increase its landing rate at a very stable pace. Meaning that the model is learning to land very effectively. Ultimately, this model reached a landing rate (SMA100) of 97% and sucess rate (SMA 100) of 71% in the end.
Note: Different seeds can result in different levels of difficulty for the agent to land the lunar module within the flags. Some seeds may result in an environment that is easier for the agent to learn, while others may result in a more challenging environment. A possible reason for this drastic improvement in result can be partially due to the environment seed.
Lowest Epochs To Solve Environment?(eg. env seed 1)(eg. env seed 2)It is noteworth to point out that there is many other ways to objectively evaluate reinforcement learning depending on the user's needs, for example, time to solve environment could also be used as a metric when it comes to evaluation.
def train_agent_1k(max_t: int=1000, eps_start: float=1.0,
eps_end: float=0.01, eps_decay: float=0.995, display_every: int=150, model_name='DQN',
video_filepath='LunarLander_training'):
'''
Train a Network agent
Parameters:
n_episodes (int): Maximum number of episodes for training
max_t (int): Maximum number of timesteps per episode
eps_start (float): Initial value of epsilon for epsilon-greedy action selection
eps_end (float): Minimum value of epsilon
eps_decay (float): Factor to decrease epsilon per episode
'''
shown = False
scores = [] # list containing scores from each episode
scores_SMA100 = []
scores_window = deque(maxlen=100) # last 100 scores
eps = eps_start # initialize epsilon
time_taken = []
time_taken_window = deque(maxlen=100)
success_rate = deque(maxlen=100)
landing_rate = deque(maxlen=100)
success_rate_SMA100 = []
landing_rate_SMA100 = []
for i_episode in tqdm(range(1, 1001)):
state = env.reset()[0]
score = 0
for t in range(max_t):
action = agent.act(state, eps)
next_state, reward, done, _, _ = env.step(action)
agent.step(state, action, reward, next_state, done)
state = next_state
score += reward
if done:
if score >= 200:
success_rate.append(1)
landing_rate.append(1)
elif score >= 120:
success_rate.append(0)
landing_rate.append(1)
else:
success_rate.append(0)
landing_rate.append(0)
break
scores_window.append(score) # save most recent score
scores_SMA100.append(np.mean(scores_window))
scores.append(score) # save most recent score
time_taken.append(t)
time_taken_window.append(t)
landing_rate_SMA100.append(landing_rate.count(1))
success_rate_SMA100.append(success_rate.count(1))
eps = max(eps_end, eps_decay*eps) # decrease epsilon
if i_episode % display_every == 0:
# SMA100: Average of past 100 period (Simple Moving Average)
print(f'\rEpisode {i_episode}\tAvg Score (SMA100): {np.mean(scores_window):.3f} Current Score: {scores_window[-1]:.0f}\nAvg Episode Length (SMA100): {np.mean(time_taken_window)} Current Episode Length: {time_taken_window[-1]:.3f}\nLanding Rate: {landing_rate.count(1):.0f}% | Success Rate: {success_rate.count(1):.0f}%\n')
torch.save(agent.qnetwork_online.state_dict(), f'.\models\{model_name+str(i_episode)}_train.pth')
if i_episode % (display_every*2) == 0:
save_video(agent, video_filepath, f'{model_name+str(i_episode)}_train.pth', seed = i_episode)
elif i_episode % (display_every*2+1) == 0:
show_video(video_filepath, 200)
if np.mean(scores_window)>=200.0 and not shown:
shown = True
print('\nEnvironment solved in {:d} episodes!'.format(i_episode))
print(f'\rAvg Score (SMA100): {np.mean(scores_window):.3f} Current Score: {scores_window[-1]:.0f}\nAvg Episode Length (SMA100): {np.mean(time_taken_window)} Current Episode Length: {time_taken_window[-1]:.3f}\nLanding Rate: {landing_rate.count(1):.0f}% | Success Rate: {success_rate.count(1):.0f}%\n')
torch.save(agent.qnetwork_online.state_dict(), f'.\models\{model_name}_best.pth')
return {
'scores': scores, 'scores_SMA100': scores_SMA100,
'scores_window': scores_window, 'time_taken': time_taken,
'time_taken_window': time_taken_window, 'success_rate': success_rate,
'landing_rate': landing_rate, 'landing_rate_SMA100': landing_rate_SMA100,
'success_rate_SMA100': success_rate_SMA100, 'eps':eps
}
def train_policy_1k(env,policy, optimizer,
discount_factor=0.99, ppo_steps=5, ppo_clip=0.2,
max_t=1000, model_name='PPO_ActorCritic',display_every=100):
# Put model to train
policy.train()
# Metris variables
shown = False
scores = [] # list containing scores from each episode
scores_SMA100 = []
scores_window = deque(maxlen=100) # last 100 scores
time_taken = []
time_taken_window = deque(maxlen=100)
success_rate = deque(maxlen=100)
landing_rate = deque(maxlen=100)
success_rate_SMA100 = []
landing_rate_SMA100 = []
for i_episode in tqdm(range(1, 1001)):
score = 0
state = env.reset()[0]
# Policy variables
states = []
actions = []
log_prob_actions = []
values = []
rewards = []
for t in range(max_t):
state = torch.FloatTensor(state).unsqueeze(0)
#append state here, not after we get the next state from env.step()
states.append(state)
action_pred, value_pred = policy(state)
action_prob = F.softmax(action_pred, dim = -1)
dist = distributions.Categorical(action_prob)
action = dist.sample()
log_prob_action = dist.log_prob(action)
state, reward, done, _, _ = env.step(action.item())
actions.append(action)
log_prob_actions.append(log_prob_action)
values.append(value_pred)
rewards.append(reward)
score += reward
if done:
if score >= 200:
success_rate.append(1)
landing_rate.append(1)
elif score >= 120:
success_rate.append(0)
landing_rate.append(1)
else:
success_rate.append(0)
landing_rate.append(0)
break
### Record Metrics ###
scores_window.append(score) # save most recent score
scores_SMA100.append(np.mean(scores_window))
scores.append(score) # save most recent score
time_taken.append(t)
time_taken_window.append(t)
landing_rate_SMA100.append(landing_rate.count(1))
success_rate_SMA100.append(success_rate.count(1))
states = torch.cat(states)
actions = torch.cat(actions)
log_prob_actions = torch.cat(log_prob_actions)
values = torch.cat(values).squeeze(-1)
returns = calculate_returns(rewards, discount_factor)
advantages = calculate_advantages(returns, values)
update_policy(policy, states, actions, log_prob_actions, advantages, returns, optimizer, ppo_steps, ppo_clip)
if i_episode % display_every == 0:
# SMA100: Average of past 100 period (Simple Moving Average)
print(f'\rEpisode {i_episode}\tAvg Score (SMA100): {np.mean(scores_window):.3f} Current Score: {scores_window[-1]:.0f}\nAvg Episode Length (SMA100): {np.mean(time_taken_window)} Current Episode Length: {time_taken_window[-1]:.0f}\nLanding Rate: {landing_rate.count(1):.0f}% | Success Rate: {success_rate.count(1):.0f}%\n')
torch.save(policy.state_dict(), f'.\models\{model_name+str(i_episode)}_train.pth')
if i_episode % (display_every*2) == 0:
save_video_PPO(policy, 'PPO', f'{model_name+str(i_episode)}_train.pth',max_t=max_t, seed = i_episode)
elif i_episode % (display_every*2+1) == 0:
show_video('PPO', 200)
if np.mean(scores_window)>=200.0 and not shown:
shown = True
print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)))
torch.save(policy.state_dict(), f'.\models\{model_name}_best.pth')
return {
'scores': scores, 'scores_SMA100': scores_SMA100,
'scores_window': scores_window, 'time_taken': time_taken,
'time_taken_window': time_taken_window, 'success_rate': success_rate,
'landing_rate': landing_rate, 'landing_rate_SMA100': landing_rate_SMA100,
'success_rate_SMA100': success_rate_SMA100
}
def test_agent(max_t: int=1000,
eps: float=0.01, display_every: int=150, model_ckpt = 'DQN_1K_best.pth', agent=None):
'''
Train a Network agent
Parameters:
max_t (int): Maximum number of timesteps per episode
eps: Will be set to 0.01 as it is the final epsilon returned after training
'''
scores = [] # list containing scores from each episode
scores_SMA100 = []
scores_window = deque(maxlen=100) # last 100 scores
time_taken = []
time_taken_window = deque(maxlen=100)
success_rate = deque(maxlen=100)
landing_rate = deque(maxlen=100)
success_rate_SMA100 = []
landing_rate_SMA100 = []
agent.qnetwork_online.load_state_dict(torch.load('./models/' + model_ckpt))
agent.qnetwork_online.eval()
for i_episode in tqdm(range(1, 501)):
state = env.reset()[0]
score = 0
for t in range(max_t):
state = torch.from_numpy(state).float().unsqueeze(0).to(device)
with torch.no_grad():
action_values = agent.qnetwork_online(state)
if random.random() > eps:
action = np.argmax(action_values.cpu().data.numpy())
else:
action = random.choice(np.arange(4))
next_state, reward, done, _, _ = env.step(action)
state = next_state
score += reward
if done:
if score >= 200:
success_rate.append(1)
landing_rate.append(1)
elif score >= 120:
success_rate.append(0)
landing_rate.append(1)
else:
success_rate.append(0)
landing_rate.append(0)
break
scores_window.append(score) # save most recent score
scores_SMA100.append(np.mean(scores_window))
scores.append(score) # save most recent score
time_taken.append(t)
time_taken_window.append(t)
landing_rate_SMA100.append(landing_rate.count(1))
success_rate_SMA100.append(success_rate.count(1))
if i_episode % display_every == 0:
# SMA100: Average of past 100 period (Simple Moving Average)
print(f'\rEpisode {i_episode}\tAvg Score (SMA100): {np.mean(scores_window):.3f} Current Score: {scores_window[-1]:.0f}\nAvg Episode Length (SMA100): {np.mean(time_taken_window)} Current Episode Length: {time_taken_window[-1]:.3f}\nLanding Rate: {landing_rate.count(1):.0f}% | Success Rate: {success_rate.count(1):.0f}%\n')
return {
'scores': scores, 'scores_SMA100': scores_SMA100,
'scores_window': scores_window, 'time_taken': time_taken,
'time_taken_window': time_taken_window, 'success_rate': success_rate,
'landing_rate': landing_rate, 'landing_rate_SMA100': landing_rate_SMA100,
'success_rate_SMA100': success_rate_SMA100
}
def test_policy(env, policy, max_t=1000,
display_every=100, model_ckpt=''):
# Load saved weights
policy.load_state_dict(torch.load('./models/' + model_ckpt + '.pth'))
# Put model to eval
policy.eval()
# Metrics variables
scores = [] # list containing scores from each episode
scores_window = deque(maxlen=100) # last 100 scores
time_taken = []
time_taken_window = deque(maxlen=100)
success_rate = deque(maxlen=100)
landing_rate = deque(maxlen=100)
success_rate_SMA100 = []
landing_rate_SMA100 = []
scores_SMA100 = []
for i_episode in tqdm(range(1, 500+1)):
score = 0
state = env.reset()[0]
for t in range(max_t):
state = torch.FloatTensor(state).unsqueeze(0)
action_pred, _ = policy(state)
action_prob = F.softmax(action_pred, dim = -1)
dist = distributions.Categorical(action_prob)
action = dist.sample()
state, reward, done, _, _ = env.step(action.item())
score += reward
if done:
if score >= 200:
success_rate.append(1)
landing_rate.append(1)
elif score >= 120:
success_rate.append(0)
landing_rate.append(1)
else:
success_rate.append(0)
landing_rate.append(0)
break
# Record metrics
scores_window.append(score) # save most recent score
scores_SMA100.append(np.mean(scores_window))
scores.append(score) # save most recent score
time_taken.append(t)
time_taken_window.append(t)
landing_rate_SMA100.append(landing_rate.count(1))
success_rate_SMA100.append(success_rate.count(1))
if i_episode % display_every == 0:
# SMA100: Average of past 100 period (Simple Moving Average)
print(f'\rEpisode {i_episode}\tAvg Score (SMA100): {np.mean(scores_window):.3f} Current Score: {scores_window[-1]:.0f}\nAvg Episode Length (SMA100): {np.mean(time_taken_window)} Current Episode Length: {time_taken_window[-1]:.0f}\nLanding Rate: {landing_rate.count(1):.0f}% | Success Rate: {success_rate.count(1):.0f}%\n')
return {
'scores': scores, 'scores_SMA100': scores_SMA100,
'scores_window': scores_window, 'time_taken': time_taken,
'time_taken_window': time_taken_window, 'success_rate': success_rate,
'landing_rate': landing_rate, 'landing_rate_SMA100': landing_rate_SMA100,
'success_rate_SMA100': success_rate_SMA100
}
Trained models (All 1000 episodes)
DQN - Training environment (seeds 0, 1)
np.random.seed(0)
env = gym.make('LunarLander-v2',enable_wind=True)
env.reset(seed = 1)
agent = Agent(8, 4, hidden_dim=64, network=QNetwork)
DQN_trained_1k = train_agent_1k(display_every=200, max_t=1000, model_name='DQN_1K')
DQN_trained_1k['eps']
0%| | 0/1000 [00:00<?, ?it/s]
Episode 200 Avg Score (SMA100): -186.146 Current Score: -160 Avg Episode Length (SMA100): 148.93 Current Episode Length: 168.000 Landing Rate: 0% | Success Rate: 0% Episode 400 Avg Score (SMA100): -74.060 Current Score: -17 Avg Episode Length (SMA100): 684.3 Current Episode Length: 999.000 Landing Rate: 3% | Success Rate: 1% Moviepy - Building video video/LunarLander_training.mp4. Moviepy - Writing video video/LunarLander_training.mp4
t: 0%| | 0/418 [00:00<?, ?it/s, now=None]
t: 13%|█▎ | 53/418 [00:00<00:00, 526.39it/s, now=None]
t: 43%|████▎ | 179/418 [00:00<00:00, 952.50it/s, now=None]
t: 72%|███████▏ | 303/418 [00:00<00:00, 1080.26it/s, now=None]
t: 100%|██████████| 418/418 [00:00<00:00, 1105.74it/s, now=None]
Moviepy - Done ! Moviepy - video ready video/LunarLander_training.mp4
Episode 600 Avg Score (SMA100): -46.550 Current Score: -182 Avg Episode Length (SMA100): 471.64 Current Episode Length: 118.000 Landing Rate: 30% | Success Rate: 7% Episode 800 Avg Score (SMA100): -70.309 Current Score: -82 Avg Episode Length (SMA100): 934.42 Current Episode Length: 999.000 Landing Rate: 23% | Success Rate: 5% Moviepy - Building video video/LunarLander_training.mp4. Moviepy - Writing video video/LunarLander_training.mp4
t: 0%| | 0/1000 [00:00<?, ?it/s, now=None]
t: 5%|▍ | 47/1000 [00:00<00:02, 467.45it/s, now=None]
t: 14%|█▍ | 139/1000 [00:00<00:01, 732.10it/s, now=None]
t: 24%|██▍ | 244/1000 [00:00<00:00, 875.23it/s, now=None]
t: 34%|███▍ | 344/1000 [00:00<00:00, 922.94it/s, now=None]
t: 44%|████▎ | 437/1000 [00:00<00:00, 911.00it/s, now=None]
t: 54%|█████▍ | 540/1000 [00:00<00:00, 949.98it/s, now=None]
t: 64%|██████▍ | 643/1000 [00:00<00:00, 975.44it/s, now=None]
t: 74%|███████▍ | 741/1000 [00:00<00:00, 970.01it/s, now=None]
t: 84%|████████▍ | 839/1000 [00:00<00:00, 970.93it/s, now=None]
t: 96%|█████████▌| 956/1000 [00:01<00:00, 1029.17it/s, now=None]
Moviepy - Done ! Moviepy - video ready video/LunarLander_training.mp4
Episode 1000 Avg Score (SMA100): 129.990 Current Score: 26 Avg Episode Length (SMA100): 542.53 Current Episode Length: 999.000 Landing Rate: 80% | Success Rate: 47%
0.01
saveJSON(DQN_trained_1k,'DQN_trained_1k.json')
DDQN - Training Environment (seeds 0, 1)
np.random.seed(0)
env = gym.make('LunarLander-v2',enable_wind=True)
env.reset(seed = 1)
agent = Agent(8, 4, hidden_dim=32, network=DDQN)
DDQN_trained_1k = train_agent_1k(display_every=200, max_t=1000, model_name='DDQN_1K')
DDQN_trained_1k['eps']
0%| | 0/1000 [00:00<?, ?it/s]
Episode 200 Avg Score (SMA100): -330.463 Current Score: -186 Avg Episode Length (SMA100): 109.86 Current Episode Length: 117.000 Landing Rate: 0% | Success Rate: 0% Episode 400 Avg Score (SMA100): -138.476 Current Score: -39 Avg Episode Length (SMA100): 560.67 Current Episode Length: 999.000 Landing Rate: 0% | Success Rate: 0% Moviepy - Building video video/LunarLander_training.mp4. Moviepy - Writing video video/LunarLander_training.mp4
t: 0%| | 0/1000 [00:00<?, ?it/s, now=None]
t: 5%|▌ | 52/1000 [00:00<00:01, 516.24it/s, now=None]
t: 17%|█▋ | 172/1000 [00:00<00:00, 914.58it/s, now=None]
t: 28%|██▊ | 284/1000 [00:00<00:00, 1005.71it/s, now=None]
t: 39%|███▉ | 394/1000 [00:00<00:00, 1039.37it/s, now=None]
t: 51%|█████▏ | 513/1000 [00:00<00:00, 1089.85it/s, now=None]
t: 64%|██████▍ | 638/1000 [00:00<00:00, 1140.52it/s, now=None]
t: 76%|███████▌ | 756/1000 [00:00<00:00, 1151.25it/s, now=None]
t: 88%|████████▊ | 878/1000 [00:00<00:00, 1170.21it/s, now=None]
Moviepy - Done ! Moviepy - video ready video/LunarLander_training.mp4
Episode 600 Avg Score (SMA100): -147.844 Current Score: -144 Avg Episode Length (SMA100): 336.39 Current Episode Length: 147.000 Landing Rate: 1% | Success Rate: 0% Episode 800 Avg Score (SMA100): -108.500 Current Score: -57 Avg Episode Length (SMA100): 440.15 Current Episode Length: 999.000 Landing Rate: 5% | Success Rate: 0% Moviepy - Building video video/LunarLander_training.mp4. Moviepy - Writing video video/LunarLander_training.mp4
t: 0%| | 0/193 [00:00<?, ?it/s, now=None]
t: 28%|██▊ | 54/193 [00:00<00:00, 537.78it/s, now=None]
t: 83%|████████▎ | 161/193 [00:00<00:00, 847.69it/s, now=None]
Moviepy - Done ! Moviepy - video ready video/LunarLander_training.mp4
Episode 1000 Avg Score (SMA100): -169.918 Current Score: -28 Avg Episode Length (SMA100): 611.61 Current Episode Length: 999.000 Landing Rate: 0% | Success Rate: 0%
0.01
PPO - Training Environment
actor = MLP(8, 128, 4).to(device)
critic = MLP(8, 128, 1).to(device)
policy = ActorCritic(actor, critic)
policy.apply(init_weights)
optimizer = optim.Adam(policy.parameters(), lr = 0.0005)
np.random.seed(0)
env = gym.make('LunarLander-v2',enable_wind=True)
env.reset(seed = 1)
PPO_trained_1k = train_policy_1k(env, policy, optimizer, 0.99, 5, 0.2, 1000,'PPO_trained_1k',200)
0%| | 0/1000 [00:00<?, ?it/s]
Episode 200 Avg Score (SMA100): -147.075 Current Score: -71 Avg Episode Length (SMA100): 110.81 Current Episode Length: 194 Landing Rate: 0% | Success Rate: 0% Episode 400 Avg Score (SMA100): -108.386 Current Score: -82 Avg Episode Length (SMA100): 616.46 Current Episode Length: 706 Landing Rate: 0% | Success Rate: 0% Moviepy - Building video video/PPO.mp4. Moviepy - Writing video video/PPO.mp4
t: 0%| | 0/1000 [00:00<?, ?it/s, now=None]
t: 5%|▍ | 48/1000 [00:00<00:02, 473.10it/s, now=None]
t: 17%|█▋ | 169/1000 [00:00<00:00, 899.61it/s, now=None]
t: 30%|██▉ | 299/1000 [00:00<00:00, 1078.22it/s, now=None]
t: 43%|████▎ | 426/1000 [00:00<00:00, 1152.33it/s, now=None]
t: 55%|█████▌ | 553/1000 [00:00<00:00, 1189.53it/s, now=None]
t: 68%|██████▊ | 677/1000 [00:00<00:00, 1205.76it/s, now=None]
t: 81%|████████ | 808/1000 [00:00<00:00, 1238.49it/s, now=None]
t: 94%|█████████▍| 939/1000 [00:00<00:00, 1257.56it/s, now=None]
Moviepy - Done ! Moviepy - video ready video/PPO.mp4
Episode 600 Avg Score (SMA100): -12.633 Current Score: -10 Avg Episode Length (SMA100): 656.43 Current Episode Length: 328 Landing Rate: 11% | Success Rate: 2% Episode 800 Avg Score (SMA100): 40.668 Current Score: -77 Avg Episode Length (SMA100): 757.44 Current Episode Length: 503 Landing Rate: 28% | Success Rate: 5% Moviepy - Building video video/PPO.mp4. Moviepy - Writing video video/PPO.mp4
t: 0%| | 0/556 [00:00<?, ?it/s, now=None]
t: 10%|█ | 56/556 [00:00<00:00, 557.95it/s, now=None]
t: 32%|███▏ | 179/556 [00:00<00:00, 952.15it/s, now=None]
t: 54%|█████▍ | 302/556 [00:00<00:00, 1075.64it/s, now=None]
t: 76%|███████▋ | 424/556 [00:00<00:00, 1132.46it/s, now=None]
t: 99%|█████████▉| 552/556 [00:00<00:00, 1185.37it/s, now=None]
Moviepy - Done ! Moviepy - video ready video/PPO.mp4
Episode 1000 Avg Score (SMA100): 45.145 Current Score: 86 Avg Episode Length (SMA100): 683.03 Current Episode Length: 785 Landing Rate: 24% | Success Rate: 6%
DQN + PER - Training Environment
GAMMA = 0.9875
np.random.seed(0)
env = gym.make('LunarLander-v2',enable_wind=True)
env.reset(seed = 1)
agent = PTRAgent(8, 4, hidden_dim=32, LR=0.0001077, weight_decay=0.000001, network=QNetwork)
DQN_PER_trained_1k = train_agent_1k(display_every=200, max_t=800,model_name='DQN_PER_1k')
DQN_PER_trained_1k['eps']
0%| | 0/1000 [00:00<?, ?it/s]
Episode 200 Avg Score (SMA100): -111.753 Current Score: -204 Avg Episode Length (SMA100): 290.27 Current Episode Length: 534.000 Landing Rate: 0% | Success Rate: 0% Episode 400 Avg Score (SMA100): -74.284 Current Score: -39 Avg Episode Length (SMA100): 632.84 Current Episode Length: 799.000 Landing Rate: 0% | Success Rate: 0% Moviepy - Building video video/LunarLander_training.mp4. Moviepy - Writing video video/LunarLander_training.mp4
t: 0%| | 0/896 [00:00<?, ?it/s, now=None]
t: 6%|▌ | 54/896 [00:00<00:01, 537.00it/s, now=None]
t: 21%|██ | 186/896 [00:00<00:00, 993.75it/s, now=None]
t: 35%|███▌ | 317/896 [00:00<00:00, 1133.04it/s, now=None]
t: 50%|█████ | 449/896 [00:00<00:00, 1204.96it/s, now=None]
t: 64%|██████▍ | 573/896 [00:00<00:00, 1216.00it/s, now=None]
t: 79%|███████▊ | 704/896 [00:00<00:00, 1247.39it/s, now=None]
t: 93%|█████████▎| 834/896 [00:00<00:00, 1263.62it/s, now=None]
Moviepy - Done ! Moviepy - video ready video/LunarLander_training.mp4
Episode 600 Avg Score (SMA100): 22.136 Current Score: 134 Avg Episode Length (SMA100): 695.06 Current Episode Length: 799.000 Landing Rate: 24% | Success Rate: 10% Episode 800 Avg Score (SMA100): 119.191 Current Score: 268 Avg Episode Length (SMA100): 632.84 Current Episode Length: 799.000 Landing Rate: 68% | Success Rate: 58% Moviepy - Building video video/LunarLander_training.mp4. Moviepy - Writing video video/LunarLander_training.mp4
t: 0%| | 0/211 [00:00<?, ?it/s, now=None]
t: 30%|███ | 64/211 [00:00<00:00, 637.27it/s, now=None]
t: 94%|█████████▍| 198/211 [00:00<00:00, 1045.46it/s, now=None]
Moviepy - Done ! Moviepy - video ready video/LunarLander_training.mp4
Episode 1000 Avg Score (SMA100): 187.603 Current Score: 160 Avg Episode Length (SMA100): 391.93 Current Episode Length: 451.000 Landing Rate: 86% | Success Rate: 71% Moviepy - Done ! Moviepy - video ready video/LunarLander_training.mp4
0.01
DQN - Testing Environment
np.random.seed(2)
env = gym.make('LunarLander-v2',enable_wind=True)
env.reset(seed = 2)
agent_dqn = Agent(8, 4, hidden_dim=64, network=QNetwork)
DQN_test_results = test_agent(1000, model_ckpt = 'DQN_1K_best.pth', display_every=100,agent = agent_dqn)
0%| | 0/500 [00:00<?, ?it/s]
Episode 100 Avg Score (SMA100): 113.749 Current Score: 134 Avg Episode Length (SMA100): 459.14 Current Episode Length: 756.000 Landing Rate: 61% | Success Rate: 49% Episode 200 Avg Score (SMA100): 103.495 Current Score: 226 Avg Episode Length (SMA100): 457.06 Current Episode Length: 609.000 Landing Rate: 68% | Success Rate: 44% Episode 300 Avg Score (SMA100): 123.541 Current Score: 242 Avg Episode Length (SMA100): 492.69 Current Episode Length: 650.000 Landing Rate: 70% | Success Rate: 50% Episode 400 Avg Score (SMA100): 100.395 Current Score: 197 Avg Episode Length (SMA100): 507.05 Current Episode Length: 634.000 Landing Rate: 68% | Success Rate: 48% Episode 500 Avg Score (SMA100): 118.841 Current Score: 211 Avg Episode Length (SMA100): 473.13 Current Episode Length: 445.000 Landing Rate: 72% | Success Rate: 53%
saveJSON(DQN_test_results, 'DQN_test.json')
DDQN - Testing Environment
np.random.seed(2)
env = gym.make('LunarLander-v2',enable_wind=True)
env.reset(seed = 2)
agent_ddqn = Agent(8, 4, hidden_dim=32, network=DDQN)
DDQN_test_results = test_agent(1000, model_ckpt = 'DDQN_1K_best.pth', display_every=100, agent=agent_ddqn)
0%| | 0/500 [00:00<?, ?it/s]
Episode 100 Avg Score (SMA100): -191.774 Current Score: -257 Avg Episode Length (SMA100): 650.76 Current Episode Length: 812.000 Landing Rate: 0% | Success Rate: 0% Episode 200 Avg Score (SMA100): -190.667 Current Score: -113 Avg Episode Length (SMA100): 656.98 Current Episode Length: 247.000 Landing Rate: 0% | Success Rate: 0% Episode 300 Avg Score (SMA100): -178.545 Current Score: -276 Avg Episode Length (SMA100): 631.19 Current Episode Length: 466.000 Landing Rate: 2% | Success Rate: 1% Episode 400 Avg Score (SMA100): -189.764 Current Score: -255 Avg Episode Length (SMA100): 655.66 Current Episode Length: 423.000 Landing Rate: 1% | Success Rate: 0% Episode 500 Avg Score (SMA100): -189.797 Current Score: -217 Avg Episode Length (SMA100): 670.35 Current Episode Length: 545.000 Landing Rate: 0% | Success Rate: 0%
saveJSON(DDQN_test_results, 'DDQN_test.json')
PPO - Testing Environment
np.random.seed(2)
env = gym.make('LunarLander-v2',enable_wind=True)
env.reset(seed = 2)
actor = MLP(8, 128, 4).to(device)
critic = MLP(8, 128, 1).to(device)
policy = ActorCritic(actor, critic)
policy.apply(init_weights)
PPO_test_results = test_policy(env, policy, max_t=1000, display_every=100, model_ckpt='PPO_trained_1k1000_train')
0%| | 0/500 [00:00<?, ?it/s]
Episode 100 Avg Score (SMA100): -11.810 Current Score: -97 Avg Episode Length (SMA100): 743.14 Current Episode Length: 211 Landing Rate: 16% | Success Rate: 2% Episode 200 Avg Score (SMA100): 28.305 Current Score: 116 Avg Episode Length (SMA100): 793.0 Current Episode Length: 761 Landing Rate: 26% | Success Rate: 1% Episode 300 Avg Score (SMA100): 7.690 Current Score: 13 Avg Episode Length (SMA100): 771.94 Current Episode Length: 999 Landing Rate: 25% | Success Rate: 0% Episode 400 Avg Score (SMA100): -19.904 Current Score: 80 Avg Episode Length (SMA100): 708.16 Current Episode Length: 999 Landing Rate: 20% | Success Rate: 0% Episode 500 Avg Score (SMA100): -6.674 Current Score: 107 Avg Episode Length (SMA100): 719.03 Current Episode Length: 790 Landing Rate: 14% | Success Rate: 1%
saveJSON(PPO_test_results, 'PPO_test.json')
DQN + PER - Testing Environment
# Reset seed - Use same seed for all experiments (objective comparison)
np.random.seed(2)
env = gym.make('LunarLander-v2',enable_wind=True)
env.reset(seed = 2)
agent_dqn_per = Agent(8, 4, hidden_dim=32, network=QNetwork)
DQN_PER_test_results = test_agent(1000, model_ckpt = 'DQN_PER_1k_best.pth', display_every=100, agent=agent_dqn_per)
0%| | 0/500 [00:00<?, ?it/s]
Episode 100 Avg Score (SMA100): 198.433 Current Score: 229 Avg Episode Length (SMA100): 437.23 Current Episode Length: 338.000 Landing Rate: 87% | Success Rate: 69% Episode 200 Avg Score (SMA100): 188.209 Current Score: -32 Avg Episode Length (SMA100): 429.25 Current Episode Length: 440.000 Landing Rate: 86% | Success Rate: 70% Episode 300 Avg Score (SMA100): 190.582 Current Score: 45 Avg Episode Length (SMA100): 484.18 Current Episode Length: 167.000 Landing Rate: 90% | Success Rate: 69% Episode 400 Avg Score (SMA100): 182.821 Current Score: 263 Avg Episode Length (SMA100): 451.55 Current Episode Length: 412.000 Landing Rate: 88% | Success Rate: 73% Episode 500 Avg Score (SMA100): 175.023 Current Score: 214 Avg Episode Length (SMA100): 449.08 Current Episode Length: 367.000 Landing Rate: 84% | Success Rate: 60%
saveJSON(DQN_PER_test_results, 'DQN_PER_test.json')
DQN - Test Results
# Plot result
DQN_test = loadJSON('DQN_test.json')
sns.set_style("whitegrid")
plotResult(DQN_test, [['success_rate_SMA100', 'landing_rate_SMA100'], ['scores_SMA100']])
Past 500 Episodes ==================================== Final success_rate_SMA100: 53.00 Final landing_rate_SMA100: 72.00 Final scores_SMA100: 118.84
Observations
In the 500 Episodes trained, the standard DQN achieved an average landing rate (SMA100) of 61.73% and an average success rate (SMA100) of 43.95%. Also, it achieved an average score (SMA100) of 110 in the testing results.
DDQN - Test Results
# Plot result
DDQN_test = loadJSON('DDQN_test.json')
sns.set_style("whitegrid")
plotResult(DDQN_test, [['success_rate_SMA100', 'landing_rate_SMA100'], ['scores_SMA100']])
Past 500 Episodes ==================================== Final success_rate_SMA100: 0.00 Final landing_rate_SMA100: 0.00 Final scores_SMA100: -189.80
Observations
In the 500 Episodes trained, the DDQN algorithm achieved a highest landing rate (SMA100) of 2% and a highest success rate (SMA100) of 1%. Also, it achieved an average score (SMA100) of -188 in the testing results. This set of result is extremely poor compared to the standard DQN.
PPO - Test Results
# Plot result
PPO_test = loadJSON('PPO_test.json')
sns.set_style("whitegrid")
plotResult(PPO_test, [['success_rate_SMA100', 'landing_rate_SMA100'], ['scores_SMA100']])
Past 500 Episodes ==================================== Final success_rate_SMA100: 1.00 Final landing_rate_SMA100: 14.00 Final scores_SMA100: -6.67
Observations
In the 500 Episodes trained, the PPO algorithm achieved an average landing rate (SMA100) of 18% and an average success rate (SMA100) of 0.8%. Also, it achieved an average score (SMA100) of -2.33 in the testing results. Although slightly better than the DDQN, it is still extremely poor compared to the standard DQN.
DQN + PER - Test Results
# Plot result
DQN_PER_test = loadJSON('DQN_PER_test.json')
sns.set_style("whitegrid")
plotResult(DQN_PER_test, [['success_rate_SMA100', 'landing_rate_SMA100'], ['scores_SMA100']])
Past 500 Episodes ==================================== Final success_rate_SMA100: 60.00 Final landing_rate_SMA100: 84.00 Final scores_SMA100: 175.02
Observations
In the 500 Episodes trained, the optimized DQN + PER algorithm achieved an average landing rate (SMA100) of 79.2% and an average success rate (SMA100) of 63%. Also, it achieved an average scores (SMA100) of 190 in the testing results. Out of all the algorithms evaluated, this set of result is by far the best.
Displaying videos for each model.
reward_dqn = save_video(agent_dqn, 'DQN_test', 'DQN_1K_best.pth', 1000, 0)
reward_dqn
Moviepy - Building video video/DQN_test.mp4. Moviepy - Writing video video/DQN_test.mp4
Moviepy - Done ! Moviepy - video ready video/DQN_test.mp4
163.0976443844678
reward_ddqn = save_video(agent_ddqn, 'DDQN_test', 'DDQN_1K_best.pth', 1000, 0)
reward_ddqn
Moviepy - Building video video/DDQN_test.mp4. Moviepy - Writing video video/DDQN_test.mp4
Moviepy - Done ! Moviepy - video ready video/DDQN_test.mp4
-184.18056191070494
reward_ppo = save_video_PPO(policy, 'PPO_test', 'PPO_trained_1k1000_train.pth', 1000, seed = 0)
reward_ppo
Moviepy - Building video video/PPO_test.mp4. Moviepy - Writing video video/PPO_test.mp4
Moviepy - Done ! Moviepy - video ready video/PPO_test.mp4
72.27011975068002
reward_dqn_per = save_video(agent_dqn_per, 'DQN_PER_test', 'DQN_PER_1k_best.pth', 1000, 0)
reward_dqn_per
Moviepy - Building video video/DQN_PER_test.mp4. Moviepy - Writing video video/DQN_PER_test.mp4
Moviepy - Done ! Moviepy - video ready video/DQN_PER_test.mp4
305.26111013472416
Embedding mp4 files into html ⚙️
filepaths = ['DQN_test.mp4','DDQN_test.mp4','PPO_test.mp4','DQN_PER_test.mp4']
rewards = [reward_dqn, reward_ddqn, reward_ppo, reward_dqn_per]
grid_html = '''
<style>
.video-grid {{
display: table;
width: 100%;
}}
.video-row {{
display: table-row;
}}
.video-item {{
display: table-cell;
width: 50%;
height: 400px;
text-align: center;
vertical-align: middle;
}}
</style>
<div class="video-grid">
{}
</div>
'''
video_html = '''
<div class="video-item">
<h3>{}</h3>
<video alt="test" autoplay loop controls width="80%">
<source src="data:video/mp4;base64,{}" type="video/mp4" />
</video>
</div>
'''
videos = ''
for i, file_name in enumerate(filepaths):
mp4 = 'video/{}'.format(file_name)
video = io.open(mp4, 'r+b').read()
encoded = base64.b64encode(video)
video_item = video_html.format(f'{file_name} | reward: {rewards[i]:.3f}', encoded.decode('ascii'))
if i % 2 == 0:
videos += '<div class="video-row">'
videos += video_item
if i % 2 == 1:
videos += '</div>'
display.display(display.HTML(grid_html.format(videos)))
In the same environment,
It can be seen that the DDQN failed to land the spaceship.
The PPO managed to land the spaceship but not within the flag.
The DQN managed to land the spaceship within the flag (successfully) in 19 seconds - inefficient.
Lastly, the DQN + PER managed to land the spaceship successfully in only 5 seconds, outperforming the other models.
In this project, we successfully improved our reinforcement learning implementation by constructing and testing a few networks avaliable for reinforcement learning. After experimenting with various methods including DQN, DDQN, and Actor-Critic with PPO, we found that the optimal solution was a DQN with a Prioritized Replay Buffer. To optimize our model further, we conducted hyperparameter tuning on five key variables. Furthermore, to objectively evaluate our models, we used the same training environment (seed: 1) for all networks and tested them on a separate testing environment (seed: 2).